|
Page 2 of 2
Short and Long Messages
Keep in mind that pre-posted envelopes are only one place where
registered memory is consumed. Any messages that are not transferred
as part of the envelope (i.e., short and long messages) need to
operate in registered memory as well.
There are a variety of different flow control schemes to handle such
issues. Here's one simplistic example:
- Sender: sends envelope to the receiver indicating "I've got a
message of X bytes to send to you."
- Receiver: upon finding a matching MPI receive, registers the
receive buffer (if it wasn't already registered) and sends back an ACK
indicating "Ready to receive; send to address Y."
- Sender: sends the message to address Y.
- Sender: upon completion of the send, replies to the ACK with an
envelope indicating "Transfer complete."
- Receiver: de-registers the receive buffer (if necessary).
Registering and de-registering memory is typically an expensive
operation. MPI implementations typically expend a good deal of effort
optimizing caching and fast lookup systems in an attempt to minimize
time spent managing registered memory.
Progress
The good news is that once a message is given to the communication
co-processor, it will progress "in the background" without
intervention from the user application (and therefore from the MPI
implementation). Since all messages are expected, it will eventually
show up in a buffer and the network interface will inform the MPI
layer (typically when the MPI implementation polls asking for
progress). The MPI implementation will then process the buffer,
depending on its type and content.
This aspect helps with asynchronous progress of tiny and short
messages. The sender fires and forgets; the receiver will find the
message has already arrived once a matching receive is posted. But
this behavior does not necessarily help in rendezvous protocols - only
single network messages are progressed independent of the main CPU. So
the flow control messages described in the simplistic rendezvous
protocol (above) are only triggered when the MPI implementation's
progress engine is run. In a single threaded MPI implementation, this
usually only occurs when the application enters an MPI library
function.
Simply put: single-threaded MPI implementations receive a nice benefit
from eagerly-sent messages when communication co-processors are
used. They do not necessarily receive the same benefit when rendezvous
protocols are used (especially in conjunction with non-blocking MPI
communication) because the MPI progress engine still has to poll to
effect progress.
Sidebar:
MPI Quiz
|
Last column, I asked in what order messages would be received
at MPI_COMM_WORLD rank 0 from the following code:
Listing 1:
MPI Quiz (last column)
1 if (rank == 0) {
2 for (i = 0; i < size - 1; ++i)
3 MPI_Recv(buffer[i], ..., MPI_ANY_SOURCE, tag,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
4 } else {
5 MPI_Send(buffer, ..., 0, tag, MPI_COMM_WORLD);
6 }
|
Do MPI's message ordering guarantees provide any insight into the
receipt order?
No. The answer is that since all the messages are coming from
different processes, no fine-grained ordering between them is
preserved by MPI (assuming that all messages were sent at roughly the
"same" time) - the use of MPI_ANY_SOURCE was somewhat of a
red herring.
It is a not-uncommon problem to assume that this kind of code pattern
(especially when using MPI_ANY_SOURCE) will receive messages
"in order," meaning that buffer[i] will correspond to the
message sent by MPI_COMM_WORLD rank i. While the
above code is not a bad technique to avoid bottleneck delays from slow
processes, it does not guarantee the source for any
given buffer[i]. If guaranteeing the source is necessary, the
following may be more appropriate:
Listing 2:
MPI Quiz -- a better solution
1 if (rank == 0) {
2 for (i = 0; i < size - 1; ++i)
3 MPI_Irecv(buffer[i], ..., i, tag,
MPI_COMM_WORLD, &reqs[i]);
4 MPI_Waitall(size - 1, reqs, MPI_STATUSES_IGNORE)
5 } else {
6 MPI_Send(buffer, ..., 0, tag, MPI_COMM_WORLD);
7 }
|
Next question...
Will the following code deadlock? Why or why not? What will the
communication pattern be? How efficient is it?
Listing 4:
MPI Quiz: How Will This Perform?
1 left = (rank == 0) ? MPI_PROC_NULL : rank - 1;
2 right = (rank == size - 1) ? MPI_PROC_NULL : rank + 1;
3 MPI_Recv(rbuf, ..., left, tag, comm, &status);
4 MPI_Send(sbuf, ..., right, tag, comm, &status);
|
|
Where to Go From Here?
The same disclaimer from last column applies: the careful reader will
notice that there were a lot of assumptions in the
explanations given in this column. For each assumption listed above,
there are real-world MPI implementations with different assumptions.
The issues described in this column are one set of reasons why MPI
implementations are so complex. Management of resources can sometimes
be directly at odds with performance; the settings and management
algorithms to maximize performance for one application may cause
horrendous performance in another. As an MPI implementer, I beg you to
remember this the next time you curse your MPI implementation for
being slow.
Got any MPI questions you want answered? Wondering why one MPI
does this and another does that? Send
them to the MPI Monkey.
Resources
| MPI Forum (MPI-1 and MPI-2 specifications documents) |
http://www.mpi-forum.org/ |
| MPI - The Complete Reference: Volume 1, The MPI Core (2nd ed) (The
MIT Press) |
By Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and
Jack Dongarra. ISBN 0-262-69215-5 |
| MPI - The Complete Reference: Volume 2, The MPI Extensions (The
MIT Press) |
By William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing
Lusk, Bill Nitzberg, William Saphir, and Marc Snir. ISBN
0-262-57123-4. |
| NCSA MPI tutorial |
http://webct.ncsa.uiuc.edu:8900/public/MPI/ |
| MPI Community Wiki |
http://www.mpi-comm-world.org/ |
This article was originally published in ClusterWorld Magazine. It
has been updated and formatted for the web. If you want to read more
about HPC clusters and Linux, you may wish to visit
Linux Magazine.
Jeff Squyres is leading up Cisco's Open MPI efforts as part of
the Server Virtualization Business Unit.
Comment on this article
You must login to leave comments...
Other Visitors Comments
There are no comments currently....
|