MPI: The Joys of Asynchronous Communication

Article Index

Some Examples

Let's tie together all this information into a simple example:

Listing 1: Simple non-blocking send/receive example
1  MPI_Request req[2];
2  MPI_Status stat[2];
3  int rank, size, left, right;
4  MPI_Comm_rank(comm, &rank);
5  left = (rank + size - 1) % size;
6  right = (rank + 1) % size;
7  MPI_Irecv(&buffer[1], 1, MPI_INT, right, tag, comm, &req[1]);
8  MPI_Isend(&buffer[0], 1, MPI_INT, left, tag, comm, &req[0]);
9  do_other_work();
10 MPI_Waitall(2, req, stat);

In this example, each process sends a single integer to the process on its right and receives a single integer from the process on its left (wrapping around in a torus-like fashion; it's left as an exercise for the reason to figure out the clever left and right calculations). Note that both the send and receive are started in lines 7 and 8, but are not completed until line 10, allowing the application to do other work on line 9.

The MPI standard states that once a buffer has been given to a non-blocking communication function, the application is not allowed to use it until the operation has completed (i.e., until after a successful TEST or WAIT function). Note specifically that the example above sends and receives to different buffers; since both communications are potentially ongoing simultaneously, it important to give them different working areas to avoid race conditions.

It is also important to note that an MPI implementations may or may not provide asynchronous progress on message passing operations. Specifically, single-threaded MPI implementations may only be able to make progress pending messages while inside the MPI library. However, this may not be as bad as it sounds. Some types of networks provide communication co-processors that progress message passing regardless of what the application program is doing (e.g., InfiniBand, Quadrics, Myrinet, and some forms of TCP that have offload engines on the NIC). Although such networks can progress individual messages, periodic entry into the MPI library may still be necessary to complete the MPI implementation's communication protocols.

For example, consider that there is always "other work" to do in an application, and the total time required for this "other work" is always going to be more than what is required for communications. Lines 8-9 in the previous listing only allow one pass through the "other work" followed by waiting for all MPI communication to finish. This may be inefficient because all communication may be suspended during do_other_work() and only resumed during MPI_WAITALL. It may be more efficient to use a logic structure similar to:

Listing 2: Slightly better send/receive example
1  while (have_more_work) {
2      do_some_other_work();
3      if (num_reqs_left > 0) {
4      MPI_Testany(total_num_reqs, reqs, &index, &flag, stats);
5      if (flag == 1) {
6          --num_reqs_left;
7      }
8  }
9  if (num_reqs_left > 0) {
10     MPI_Waitall(total_num_reqs, reqs, stats);
11 }

The rationale with this logic is to break up the "other work" into smaller pieces and keep polling MPI for progress on all the outstanding requests. Specifically, line 2 invokes a small amount of work and then uses MPI_TESTANY to poll MPI and see if any pending requests have completed. This process repeats, giving both the application and the MPI implementation a chance to make progress.

There are actually a lot of factors involved here; every application will be different. For example, if do_some_work() relies heavily on data locality, polling through MPI_TESTANY may effectively thrash the L1 and L2 cache. You may need to adjust the granularity of do_some_other_work(), or use one of the other flavors of the TEST operation to achieve better performance. {mosgoogle right}

The moral of the story is to check and see if your MPI implementation provides true asynchronous progress or not. If it does not, then some form of periodic poll through a TEST operation may be required to achieve optimal performance. If asynchronous progress is supported, then additional polling may not be required. But unfortunately, there's no silver bullet: if you're looking for true communication and computation overlap in your application, you may need to tune its behavior with each different MPI implementation. Experimentation is key - your mileage may vary.

If you have a multi-threaded MPI implementation that does not support asynchronous progress, it may be more efficient to have a second thread block in MPI_WAITALL and let the primary thread do its computational work. L1 and L2 caching effects (among other things) will still affect the overall performance, but potentially in a different way. Threaded MPI implementations are a sufficiently complex topic that they will be discussed in a future column.

Where to Go From Here?

Non-blocking communications, when used properly, can provide a tremendous performance boost to parallel applications by allowing the MPI to perform at least some form of asynchronous progress (particularly when used with communication co-processor-based networks). Next column, we'll continue with a more in-depth look at non-blocking communications, including persistent mode sending and more examples of typical non-blocking communication programming models.

Resources
MPI Forum (MPI-1 and MPI-2 specifications documents) http://www.mpi-forum.org
MPI - The Complete Reference: Volume 1, The MPI Core (2nd ed) (The MIT Press) By Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra. ISBN 0-262-69215-5
MPI - The Complete Reference: Volume 2, The MPI Extensions (The MIT Press) By William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir, and Marc Snir. ISBN 0-262-57123-4.
NCSA MPI tutorial http://webct.ncsa.uiuc.edu:8900/public/MPI/

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Jeff Squyres is the Assistant Director for High Performance Comptuing for the Open Systems Laboratory at Indiana University and is the one of the lead technical architects of the Open MPI project.

    Search

    Login And Newsletter

    Create an account to access exclusive content, comment on articles, and receive our newsletters.

    Feedburner

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.