Asynchronous Joy
Did you ever notice that Fred Flintstone was the only one who powered his car? You could see his feet running below the car, pushing it along, but no one else ever helped. This is clearly a single inertia, multiple dependents (SIMD) situation - it's almost like Fred was providing a taxi service for all the other freeloaders. Indeed, getting some of the others to help push would have resulted in greater efficiency - a multiple inertia, multiple dependents (MIMD) scenario. I call this the Flint's Taxi Economy principle.
(say it out loud)
The Story So Far
So far, we've talked about a lot of the basics of MPI. Parallel startup and short, MPI terminology, some point-to-point and collective communication, datatypes, communicators... what on earth could be next? Is there really more?
Why yes, Virginia, yes there is. What we've talked about so far can be used to write a wide variety of MPI programs. But most of the previous discussion has been about blocking calls that, for example, may or may not execute in parallel to the main computation. Specifically, we've discussed standard mode sends and receives (MPI_SEND and MPI_RECV). Most communication networks function at least an order of magnitude slower than local computations. Hence, if an MPI process has to wait for non-local communication, critical CPU cycles are lost while the operating system blocks the process, waits for the communication, and then resumes the process. Parallel applications therefore typically run at their peak efficiency when there can be a true overlap of communication and computation.
The MPI standard allows for this overlap with several forms of non-blocking communication.
Immediate (Non-Blocking) Sends
The basic non-blocking communication function is MPI_ISEND. It gets its name from being an immediate, local operation. That is, calling MPI_ISEND will return to the caller immediately, but the communication operation may be ongoing. Specifically, the caller may not re-use the buffer until the operation started with MPI_ISEND has completed (completion semantics are discussed below). In all other ways, MPI_ISEND is just like the standard-mode MPI_SEND: think of MPI_ISEND as the starting of a communication, and think of MPI_SEND as both the starting and completion of a communication (remember that MPI defines completion in terms of buffers - the operation is complete when the buffer is available for re-use, but doesn't specify anything about whether the communication has actually occurred or not).
The C binding for MPI_ISEND is:
int MPI_Isend(void *buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm, MPI_Request *req); |
You'll notice that MPI_ISEND takes the same parameters as MPI_SEND except for the last one: a pointer to an MPI_Request. The MPI implementation will fill the request with a handle that can be used to test or wait for completion of the send. More on that later.
Sidebar: Buffered Sends Are Evil |
There are three buffered send
functions: MPI_BSEND, MPI_IBSEND,
or MPI_BSEND_INIT, corresponding to standard, immediate, and
persistent modes, respectively. Unlike the other kinds of sends, these
functions are local operations. If the send cannot complete
immediately, the MPI implementation must copy the message to an
internal buffer and complete it later. Regardless, the message buffer
must be returned to the caller and be available for use.
This may sound like a great convenience - the application is free from having to perform buffer management. But it almost guarantees a performance loss; at least some MPI implementations always do a buffer copy before attempting to send the message. This adds latency to the message's travel time. Whenever possible, avoid buffered sends. |
"Immediate" also applies to MPI's other modes of sending: synchronous, buffered, and ready. Synchronous sends will not complete until a matching receive has been posted at the destination process. Hence, when a synchronous send completes, not only is the message buffer available for re-use, the application is also guaranteed that at least some communication has occurred (although it does not guarantee that the entire message has been received yet).
Buffered sends mean that the MPI implementation may copy the message to an internal buffer before sending it. See the sidebar entitled Buffered Sends Are Evil for more information on buffered mode sends. Ready sends are an optimization when it can be guaranteed that a matching receive has already been posted at the destination (specifically, it is erroneous to initiate a ready send if a matching receive has not already been posted). In such cases, the MPI implementation may be able to take shortcuts in sending protocols to reduce latency and/or increase bandwidth. Not all MPI implementation actually implement optimizations; in the worse case, a ready send is identical to a standard send.
Aside from their names, all three modes have identical bindings to MPI_ISEND: MPI_ISSEND, MPI_IBSEND, and MPI_IRSEND for immediate synchronous, buffered, and ready send, respectively.
Immediate (Non-Blocking) Receives
MPI also has an immediate receive. Just like the immediate send, the immediate receive starts a receive (or "posts" a receive). Its C binding is identical to MPI_RECV except for the last argument:
int MPI_Irecv(void *buf, int count, MPI_Datatype dtype, int src, int tag, MPI_Comm comm, MPI_Request *req); |
Just like the immediate send functions, the final argument is a pointer to an MPI_Request. Upon return from MPI_IRECV, the request can be used for testing or waiting for the receive to complete.
Who Matches What?
With all the different flavors of sending and receiving, how can you tell which send will be matched with which receive? Thankfully, it's uncomplicated. When a process receives a message, it looks for a matching receive only based on the signature of the message: the message's tag, communicator, and source and destination arguments. The sending mode (standard, synchronous, buffered, or read) and type (blocking, non-blocking, or persistent) are not used for matching incoming messages to pending receives.
Completion of Non-Blocking Communications
Once a non-blocking communication has been started (regardless of whether it is a send or a receive), the application is given an MPI_Request check for completion of the operation. Two operations are available to check for completion of a request: MPI_TEST and MPI_WAIT:
int MPI_Test(MPI_Request *req, int *flag, MPI_Status *status); int MPI_Wait(MPI_Request *req, MPI_Status *status); |
Both take pointers to an MPI_Request and an MPI_Status. The request is checked to see if the communication has completed. If it has, the status is filled in with details about the completed operation. MPI_TEST will return immediately, regardless of whether the operation has completed (the flag argument is set to 1 if it completed); it can be used for periodic polling. MPI_WAIT will block until the operation has completed.
The TEST and WAIT functions shown above check for the completion of a single request; other variations exist for checking arrays of requests (e.g., checking for completion of one in the array, one or more in the array, or all in the array): MPI_*ANY, MPI_*SOME, MPI_*ALL, respectively, where * is either TEST or WAIT.