|
Page 2 of 2
2: Serialization
May users are nervous about using MPI's various modes of non-blocking
communications and instead simply use MPI_SEND
and MPI_RECV. This habit can lead to performance degradation
by unknowingly serializing parallel applications. Processes blocked
in MPI_SEND or MPI_RECV may be wasting valuable CPU
cycles while simply waiting for communication with peer
processes. This situation can even lead to a domino-like effect where
a series of processes are waiting for each other and progress only
occurs in a peer-by-peer fashion - just like the penguins in the
beginning of this article.
This behavior can almost always be fixed in the application. While
some algorithms simply cannot avoid this problem, most can be
re-factored to allow a true overlap of computation and
communication. Specifically: allow the MPI to perform message passing
"in the background" while the user application is performing useful
work. A common technique is to use multiple pairs of buffers, swapping
between them on successive iterations. For example, in iteration N,
initiate communication using buffer A and perform useful local work on
buffer B. In iteration N+1, swap the buffers: communicate with buffer
B and work on buffer A. See the pseudocode in Listing 1 for an
example.
Listing 1:
Communication and Computation Overlap
1 buffer_comm = A;
2 buffer_work = B;
3 for (...) {
4 /* Send the communication buffer */
5 MPI_Isend(buffer_comm, ..., &req);
6
7 /* Do useful work on the other buffer */
8 do_work(buffer_work);
9
10 /* Finish the communication */
11 MPI_Wait(&req, &status);
12
13 /* Swap the buffers */
14 buffer_tmp = buffer_comm;
15 buffer_comm = buffer_work;
16 buffer_work = buffer_tmp;
17 }
|
And the Number 1, All-Time Favorite Evil to Avoid in Parallel is...
1: Assuming MPI_SEND Will [Not] Block
In a previous edition of this column, I included a sidebar entitled
"To Block or Not To Block" describing typical user confusion as to
whether MPI_SEND is supposed block or not. It still remains a
popular question, frequently asked in multiple forms:
- "My application blocks in MPI_SEND - but only sometimes. Why?"
- "Why does my application work fine with Foo MPI, but deadlock in Bar MPI?"
- "When MPI_SEND returns, has the destination received the message?"
MPI_SEND and MPI_RECV are called "blocking" by the
MPI-1 standard, but they may or may not actually block. Whether or not
an unmatched send will block typically depends on how much buffering
the implementation provides. For example, short messages are usually
sent "eagerly" - regardless of whether a matching receive has been
posted or not. Long messages may be sent with a rendezvous protocol,
meaning that it will not actually complete until the target has
initiated a matching receive. This behavior is legal because the
semantics of MPI_SEND do not actually define whether message
has been sent when it returns. The only guarantee that MPI makes is
that the buffer is able to be re-used when MPI_SEND returns.
Receives, by their definition, will not return until a matching
message has actually been received. If a matching short message was
previously eagerly sent then it may be received "immediately" for
example. This case is called an "unexpected" message, and MPI
implementations typically provide some level of implicit buffering for
this condition: eagerly-sent, unmatched messages are simply stored in
internal buffering at the target until a matching receive is posted by
the application. A local memory copy is all that is necessary to
complete the receive.
Note that it is also legal for an MPI implementation to provide zero
buffering - to effectively disallow unexpected messages and
block MPI_SEND until a matching receive is posted (regardless
of the size of the message). MPI applications that assume at least
some level of underlying buffering are not conformant (i.e.,
applications that assume that MPI_SEND will or will not
block), and may run to completion under one MPI implementation but
block in another.
Where to Go From Here?
There you have it - my canonical list of things to avoid while
programming in parallel. Note that even though this is my
favorite list, your mileage may vary - every parallel application is
different. The real moral of the story here is to thoroughly
understand both your application and the run-time environment of the
MPI implementation that you're using. This understanding is the best
way to obtain the best performance.
Next column, we'll launch into the nitty-gritty details of
non-blocking communication. Stay tuned!
Resources
| MPI Forum (MPI-1 and MPI-2 specifications documents) |
http://www.mpi-forum.org |
| MPI - The Complete Reference: Volume 1, The MPI Core (2nd ed) (The
MIT Press) |
By Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and
Jack Dongarra. ISBN 0-262-69215-5 |
| MPI - The Complete Reference: Volume 2, The MPI Extensions (The
MIT Press) |
By William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing
Lusk, Bill Nitzberg, William Saphir, and Marc Snir. ISBN
0-262-57123-4. |
| NCSA MPI tutorial |
http://webct.ncsa.uiuc.edu:8900/public/MPI/ |
This article was originally published in ClusterWorld Magazine. It
has been updated and formatted for the web. If you want to read more
about HPC clusters and Linux, you may wish to visit
Linux Magazine.
Jeff Squyres is the Assistant Director for High Performance Computing
for the Open Systems Laboratory at Indiana University and is the one
of the lead technical architects of the Open MPI project.
Comment on this article
You must login to leave comments...
Other Visitors Comments
|