MPI: Doing More With Less

Article Index

Short and Long Messages

Keep in mind that pre-posted envelopes are only one place where registered memory is consumed. Any messages that are not transferred as part of the envelope (i.e., short and long messages) need to operate in registered memory as well.

There are a variety of different flow control schemes to handle such issues. Here's one simplistic example:

  1. Sender: sends envelope to the receiver indicating "I've got a message of X bytes to send to you."
  2. Receiver: upon finding a matching MPI receive, registers the receive buffer (if it wasn't already registered) and sends back an ACK indicating "Ready to receive; send to address Y."
  3. Sender: sends the message to address Y.
  4. Sender: upon completion of the send, replies to the ACK with an envelope indicating "Transfer complete."
  5. Receiver: de-registers the receive buffer (if necessary).

Registering and de-registering memory is typically an expensive operation. MPI implementations typically expend a good deal of effort optimizing caching and fast lookup systems in an attempt to minimize time spent managing registered memory.

Progress

The good news is that once a message is given to the communication co-processor, it will progress "in the background" without intervention from the user application (and therefore from the MPI implementation). Since all messages are expected, it will eventually show up in a buffer and the network interface will inform the MPI layer (typically when the MPI implementation polls asking for progress). The MPI implementation will then process the buffer, depending on its type and content.

This aspect helps with asynchronous progress of tiny and short messages. The sender fires and forgets; the receiver will find the message has already arrived once a matching receive is posted. But this behavior does not necessarily help in rendezvous protocols - only single network messages are progressed independent of the main CPU. So the flow control messages described in the simplistic rendezvous protocol (above) are only triggered when the MPI implementation's progress engine is run. In a single threaded MPI implementation, this usually only occurs when the application enters an MPI library function.

Simply put: single-threaded MPI implementations receive a nice benefit from eagerly-sent messages when communication co-processors are used. They do not necessarily receive the same benefit when rendezvous protocols are used (especially in conjunction with non-blocking MPI communication) because the MPI progress engine still has to poll to effect progress.

Sidebar: MPI Quiz
Last column, I asked in what order messages would be received at MPI_COMM_WORLDrank 0 from the following code:

Listing 1: MPI Quiz (last column)
1 if (rank == 0) {
2     for (i = 0; i < size - 1; ++i)
3         MPI_Recv(buffer[i], ..., MPI_ANY_SOURCE, tag, 
                   MPI_COMM_WORLD, MPI_STATUS_IGNORE);
4 } else {
5     MPI_Send(buffer, ..., 0, tag, MPI_COMM_WORLD);
6 }

Do MPI's message ordering guarantees provide any insight into the receipt order?

No. The answer is that since all the messages are coming from different processes, no fine-grained ordering between them is preserved by MPI (assuming that all messages were sent at roughly the "same" time) - the use of MPI_ANY_SOURCEwas somewhat of a red herring.

It is a not-uncommon problem to assume that this kind of code pattern (especially when using MPI_ANY_SOURCE) will receive messages "in order," meaning that buffer[i] will correspond to the message sent by MPI_COMM_WORLD rank i. While the above code is not a bad technique to avoid bottleneck delays from slow processes, it does not guarantee the source for any given buffer[i]. If guaranteeing the source is necessary, the following may be more appropriate:

Listing 2: MPI Quiz -- a better solution
1 if (rank == 0) {
2     for (i = 0; i < size - 1; ++i)
3         MPI_Irecv(buffer[i], ..., i, tag, 
                    MPI_COMM_WORLD, &reqs[i]);
4     MPI_Waitall(size - 1, reqs, MPI_STATUSES_IGNORE)
5 } else {
6     MPI_Send(buffer, ..., 0, tag, MPI_COMM_WORLD);
7 }

Next question...

Will the following code deadlock? Why or why not? What will the communication pattern be? How efficient is it?

Listing 4: MPI Quiz: How Will This Perform?
 1 left = (rank == 0) ? MPI_PROC_NULL : rank - 1;
 2 right = (rank == size - 1) ? MPI_PROC_NULL : rank + 1;
 3 MPI_Recv(rbuf, ..., left, tag, comm, &status);
 4 MPI_Send(sbuf, ..., right, tag, comm, &status);

Where to Go From Here?

The same disclaimer from last column applies: the careful reader will notice that there were a lotof assumptions in the explanations given in this column. For each assumption listed above, there are real-world MPI implementations with different assumptions.

The issues described in this column are one set of reasons why MPI implementations are so complex. Management of resources can sometimes be directly at odds with performance; the settings and management algorithms to maximize performance for one application may cause horrendous performance in another. As an MPI implementer, I beg you to remember this the next time you curse your MPI implementation for being slow.

Got any MPI questions you want answered? Wondering why one MPI does this and another does that? Send them to the MPI Monkey. {mosgoogle right}

Resources
MPI Forum (MPI-1 and MPI-2 specifications documents) http://www.mpi-forum.org/
MPI - The Complete Reference: Volume 1, The MPI Core (2nd ed) (The MIT Press) By Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra. ISBN 0-262-69215-5
MPI - The Complete Reference: Volume 2, The MPI Extensions (The MIT Press) By William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir, and Marc Snir. ISBN 0-262-57123-4.
NCSA MPI tutorial http://webct.ncsa.uiuc.edu:8900/public/MPI/
MPI Community Wiki http://www.mpi-comm-world.org/

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Jeff Squyresis leading up Cisco's Open MPI efforts as part of the Server Virtualization Business Unit.

 

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.