|
Page 2 of 2
Non-Blocking Communication
Non-blocking communication is where a lot of the bookkeeping
complexity of an MPI implementation originates. The MPI "immediate"
send functions are supposed to return immediately
(e.g., MPI_ISEND). Most MPI implementations try to
send the message immediately, but may not be able to do so. That is,
the socket will be set to non-blocking mode and the implementation
will invoke write() (or writev()). The operating
system will write as many bytes as it can before it would block and
then return.
It is up to the MPI implementation to track the fact that a given
message has "claimed" the socket (i.e., no other traffic should be
written down the socket until this message has been completely
written), how many bytes have been written, how many are left to
write, etc. Complexity arises when multiple outgoing messages are
destined for the same peer - since they all have to be multiplexed
across the same socket, the MPI must queue them up and progress them
in order (potentially in a non-blocking manner - sending as many bytes
of a envelope/message as possible each time through the progression
engine).
These issues are multiplied when you consider that there are multiple
sockets and messages that must be polled for progress simultaneously
in order to provide some degree of fairness of message delivery to all
peers. The MPI implementation typically invokes poll(2)
or select(2) to see which sockets can be written. If any are
available, it tries to progress the pending send queue for each ready
file descriptor as far as possible until it would block.
Taking this into account, re-consider the above blocking example. What
if other non-blocking sends are still waiting to be sent? Unless the
MPI can handle out-of-order sending, the blocking send must progress
them all until it can send its own message. So the simplistic
pseudocode from above is likely only relevant once the blocking send
is allowed to be sent (i.e., can "claim" the socket and all
prior messages have been sent).
Receiver Logic
The receiver has non-trivial logic to implement as well. TCP allows
for partial receives - a read() call may
receive some bytes, but not all. Analogous to the
non-blocking sending logic, the MPI implementation must maintain
bookkeeping for which receiving envelopes / messages have "claimed"
the socket, how many bytes have been received so far, how many are
left to receive, etc. The same poll() or select()
used for checking write progress can also be used to check for pending
incoming data.
Once an envelope has been fully received, it must either be matched
with a previously posted receive or placed in a special queue for
"unexpected" messages. For eager sends, the body of the message must
be received as well. If the message is expected, the body can be
received directly into the target buffer. But if the message is
unexpected, a temporary buffer must be allocated to receive the
message. This temporary buffer is then associated with the envelope in
the unexpected queue.
If the message body was not sent eagerly and was expected,
the receiver must mark the matching receive as "in progress" and send
an ACK back to the sender. Consider the following valid MPI code:
Listing 2:
Overtaking non-blocking MPI traffic
1 if (i_am_sender) {
2 MPI_Isend(long_msg, ..., peer, tag, comm, &req);
3 MPI_Send(short_msg, ..., peer, tag, comm);
4 MPI_Wait(&req, &status);
5 } else {
6 MPI_Recv(msg1, ..., peer, tag, comm);
7 MPI_Recv(msg2, ..., peer, tag, comm);
8 }
|
Now consider how the traffic will actually flow across the socket:
- Sender sends envelope for the long message
- Sender sends envelope + message for the short message
- Receiver receives envelope for the long message
- Receiver sends back ACK for long message
- Receiver receives envelope + message for the short message
- Sender receives ACK
- Sender sends envelope + message for the long message
- Receiver receives envelope + message for the long message
Sidebar:
Rendezvous Protocols: Why Bother?
|
I am frequently asked why we bother using rendezvous protocols for
sending messages - they're more complicated than eager sends (where
the entire message is sent immediately) and, at a minimum, add one
round-trip latency to the overall message transference time.
The answer is: because of limited resources. Particularly in operating
system bypass networks (such as Myrinet and Infiniband), only
"special" memory can be used to send or receive messages across the
network. But only so much "special" memory is available - sending a
single, enormous message may easily consume all of it and starve other
message passing efforts.
Put even more simply - even for TCP networks - it is unattractive to
send enormous messages without the receiver's permission. Consider
sending a 32MB message eagerly. If the peer process has not posted a
matching MPI receive, the MPI implementation will have to allocate a
temporary 32MB buffer to receive it. When a matching receive is
posted, the MPI application then has to copy the entire 32MB buffer
and then de-allocate the temporary buffer.
If this situation is repeated multiple times (e.g., multiple senders
send enormous messages to a single receiver), the receiver can easily
run out of memory. For MPI implementations that cannot guarantee that
applications will not perform this way (and most cannot), rendezvous
protocols prevent resource exhaustion at the receiver.
|
This is typically called the "overtaking" problem - although the long
message was sent first, because of the use of the rendezvous protocol,
the short message payload actually arrives first. MPI's message
ordering guarantee states that the receives will match the order in
which the sends were issued. Put simply, msg1 must receive
the long message and msg2 must receive the short message
(regardless of what order they were actually sent across the
underlying network).
The receiver must do appropriate bookkeeping to ensure that this
condition is met. As a side effect of the sample code shown above, by
the time the first MPI_RECV completes, the short message will
have been received into a temporary buffer in the unexpected queue. So
when the second MPI_RECV executes, all that is required is
copying the message from the temporary buffer to the user's buffer.
Where to Go From Here?
A disclaimer on this column: the careful reader will notice that there
were a lot of assumptions in the explanations given in this
column. For each assumption listed above, there are real-world MPI
implementations with different assumptions.
The number of assumptions provide some insight into why there are so
many different MPI implementations: every system performs
differently. Tweaking to receive the best possible performance is a
compromise between the network behavior, operating system
characteristics, management of finite resources, and behavior of the
MPI application.
Next column, we'll explore the same MPI progression kinds of issues,
but for non-sockets based kinds of networks (e.g., Myrinet,
Inifiband). Such networks typically have lower latency and more
tangible finite resources than TCP networks, so issues such as
determining the short / eager message size become quite important.
Got any MPI questions you want answered? Wondering why one MPI
does this and another does that? Send
them to the MPI Monkey.
Sidebar:
MPI Quiz
|
MPI has message ordering guarantees. But do you know what they are,
exactly? Consider the following pseudocode, with
an MPI_COMM_WORLD containing size processes:
Listing 3:
MPI Quiz
1 if (rank == 0) {
2 for (i = 0; i < size - 1; ++i)
3 MPI_Recv(buffer[i], ..., MPI_ANY_SOURCE, tag,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
4 } else {
5 MPI_Send(buffer, ..., 0, tag, MPI_COMM_WORLD);
6 }
|
In what order will the messages be received at MPI_COMM_WORLD
rank 0? See the next edition of this column for the answer.
|
Resources
| MPI Forum (MPI-1 and MPI-2 specifications documents) |
http://www.mpi-forum.org/ |
| MPI - The Complete Reference: Volume 1, The MPI Core (2nd ed) (The
MIT Press) |
By Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and
Jack Dongarra. ISBN 0-262-69215-5 |
| MPI - The Complete Reference: Volume 2, The MPI Extensions (The
MIT Press) |
By William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing
Lusk, Bill Nitzberg, William Saphir, and Marc Snir. ISBN
0-262-57123-4. |
| NCSA MPI tutorial |
http://webct.ncsa.uiuc.edu:8900/public/MPI/ |
This article was originally published in ClusterWorld Magazine. It
has been updated and formatted for the web. If you want to read more
about HPC clusters and Linux, you may wish to visit
Linux Magazine.
Jeff Squyres is leading up Cisco's Open MPI efforts as part of
the Server Virtualization Business Unit.
Comment on this article
You must login to leave comments...
Other Visitors Comments
There are no comments currently....
|