218 Introduction to High Performance Computing for Scientists and Engineers
is completely equivalent to a standard MPI_Recv().
A potential problem with nonblocking MPI is that a compiler has no way to know
that MPI_Wait() can (and usually will) modify the contents of buf. Hence, in the
following code, the compiler may consider it legal to move the final statement in line
3 before the call to MPI_Wait():
1 call MPI_Irecv(buf, ..., request, ...)
2 call MPI_Wait(request, status, ...)
3 buf(1) = buf(1) + 1
This will certainly lead to a race condition and the contents of buf may be wrong.
The inherent connection between the MPI_Irecv() and MPI_Wait() calls, me-
diated by the request handle, is invisible to the compiler, and the fact that buf is not
contained in the argument list of MPI_Wait() is sufficient to assume that the code
modification is legal. A simple way to avoid this situation is to put the variable (or
buffer) into a COMMON block, so that potentially all subroutines may modify it. See
the MPI standard [P15] for alternatives.
Multiple requests can be pending at any time, which is another great advantage
of nonblocking communication. Sometimes a group of requests belongs together in
some respect, and one would like to check not one, but any one, any number, or all
of them for completion. This can be done with suitable calls that are parameterized
with an array of handles. As an example we choose the MPI_Waitall() routine:
1 integer :: count, requests(
*
)
2 integer :: statuses(MPI_STATUS_SIZE,
*
), ierror
3 call MPI_Waitall(count, ! number of requests
4 requests, ! request handle array
5 statuses, ! statuses array (MPI_Status
*
in C)
6 ierror) ! return value
This call returns only after all the pending requests have been completed. The status
objects are available in array_of_statuses(:,:).
The integration example in Listing 9.3 can make use of nonblocking communi-
cation by overlapping the local interval integration on rank 0 with receiving results
from the other ranks. Unfortunately, collectives cannot be used here because there
are no nonblocking collectives in MPI. Listing 9.4 shows a possible solution. The
reduction operation has to be done manually (lines 33–35), as in the original code.
Array sizes for the status and request arrays are not known at compile time, hence
those must be allocated dynamically, as well as separate receive buffers for all ranks
except 0 (lines 11–13). The collection of partial results is performed with a single
MPI_Waitall() in line 32. Nothing needs to be changed on the nonroot ranks;
MPI_Send() is sufficient to communicate the partial results (line 39).
Nonblocking communication provides an obvious way to overlap communica-
tion, i.e., overhead, with useful work. The possible performance advantage, however,
depends on many factors, and may even be nonexistent (see Section 10.4.3 for a
discussion). But even if there is no real overlap, multiple outstanding nonblocking
requests may improve performance because the MPI library can decide which of
them gets serviced first.