|
Page 2 of 2
Some Examples
Let's tie together all this information into a simple example:
Listing 1:
Simple non-blocking send/receive example
1 MPI_Request req[2];
2 MPI_Status stat[2];
3 int rank, size, left, right;
4 MPI_Comm_rank(comm, &rank);
5 left = (rank + size - 1) % size;
6 right = (rank + 1) % size;
7 MPI_Irecv(&buffer[1], 1, MPI_INT, right, tag, comm, &req[1]);
8 MPI_Isend(&buffer[0], 1, MPI_INT, left, tag, comm, &req[0]);
9 do_other_work();
10 MPI_Waitall(2, req, stat);
|
In this example, each process sends a single integer to the process on
its right and receives a single integer from the process on its left
(wrapping around in a torus-like fashion; it's left as an exercise for
the reason to figure out the clever left and right
calculations). Note that both the send and receive
are started in lines 7 and 8, but are not completed
until line 10, allowing the application to do other work on line 9.
The MPI standard states that once a buffer has been given to a
non-blocking communication function, the application is not allowed to
use it until the operation has completed (i.e., until after a
successful TEST or WAIT function). Note specifically
that the example above sends and receives to different
buffers; since both communications are potentially ongoing
simultaneously, it important to give them different working areas to
avoid race conditions.
It is also important to note that an MPI implementations may or may
not provide asynchronous progress on message passing
operations. Specifically, single-threaded MPI implementations may only
be able to make progress pending messages while inside the MPI
library. However, this may not be as bad as it sounds. Some types of
networks provide communication co-processors that progress message
passing regardless of what the application program is doing (e.g.,
InfiniBand, Quadrics, Myrinet, and some forms of TCP that have offload
engines on the NIC). Although such networks can progress individual
messages, periodic entry into the MPI library may still be necessary
to complete the MPI implementation's communication protocols.
For example, consider that there is always "other work" to do in an
application, and the total time required for this "other work" is
always going to be more than what is required for
communications. Lines 8-9 in the previous listing only allow one pass
through the "other work" followed by waiting for all MPI communication
to finish. This may be inefficient because all communication may be
suspended during do_other_work() and only resumed
during MPI_WAITALL. It may be more efficient to use a logic
structure similar to:
Listing 2:
Slightly better send/receive example
1 while (have_more_work) {
2 do_some_other_work();
3 if (num_reqs_left > 0) {
4 MPI_Testany(total_num_reqs, reqs, &index, &flag, stats);
5 if (flag == 1) {
6 --num_reqs_left;
7 }
8 }
9 if (num_reqs_left > 0) {
10 MPI_Waitall(total_num_reqs, reqs, stats);
11 }
|
The rationale with this logic is to break up the "other work" into
smaller pieces and keep polling MPI for progress on all the
outstanding requests. Specifically, line 2 invokes a small
amount of work and then uses MPI_TESTANY to poll MPI and see
if any pending requests have completed. This process repeats, giving
both the application and the MPI implementation a chance to make
progress.
There are actually a lot of factors involved here; every application
will be different. For example, if do_some_work() relies
heavily on data locality, polling through MPI_TESTANY may
effectively thrash the L1 and L2 cache. You may need to adjust the
granularity of do_some_other_work(), or use one of the other
flavors of the TEST operation to achieve better performance.
The moral of the story is to check and see if your MPI implementation
provides true asynchronous progress or not. If it does not, then some
form of periodic poll through a TEST operation may be
required to achieve optimal performance. If asynchronous progress is
supported, then additional polling may not be required. But
unfortunately, there's no silver bullet: if you're looking for true
communication and computation overlap in your application, you may
need to tune its behavior with each different MPI
implementation. Experimentation is key - your mileage may vary.
If you have a multi-threaded MPI implementation that does not support
asynchronous progress, it may be more efficient to have a second
thread block in MPI_WAITALL and let the primary thread do its
computational work. L1 and L2 caching effects (among other things)
will still affect the overall performance, but potentially in a
different way. Threaded MPI implementations are a sufficiently complex
topic that they will be discussed in a future column.
Where to Go From Here?
Non-blocking communications, when used properly, can provide a
tremendous performance boost to parallel applications by allowing the
MPI to perform at least some form of asynchronous progress
(particularly when used with communication co-processor-based
networks). Next column, we'll continue with a more in-depth look at
non-blocking communications, including persistent mode sending and
more examples of typical non-blocking communication programming
models.
Resources
| MPI Forum (MPI-1 and MPI-2 specifications documents) |
http://www.mpi-forum.org |
| MPI - The Complete Reference: Volume 1, The MPI Core (2nd ed) (The
MIT Press) |
By Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and
Jack Dongarra. ISBN 0-262-69215-5 |
| MPI - The Complete Reference: Volume 2, The MPI Extensions (The
MIT Press) |
By William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing
Lusk, Bill Nitzberg, William Saphir, and Marc Snir. ISBN
0-262-57123-4. |
| NCSA MPI tutorial |
http://webct.ncsa.uiuc.edu:8900/public/MPI/ |
This article was originally published in ClusterWorld Magazine. It
has been updated and formatted for the web. If you want to read more
about HPC clusters and Linux, you may wish to visit
Linux Magazine.
Jeff Squyres is the Assistant Director for High Performance Comptuing
for the Open Systems Laboratory at Indiana University and is the one
of the lead technical architects of the Open MPI project.
Comment on this article
You must login to leave comments...
Other Visitors Comments
There are no comments currently....
|