|
Page 1 of 2
In Parallel, Everyone Hears You Scream (part deux)
A guy walks into a breakfast joint with 16 penguins. They sit down at
the biggest table in the place. The guy orders coffee for himself and
a bowl of cereal for each of the penguins. He then breaks out a
newspaper and casually starts reading. Meanwhile, the penguins'
breakfasts arrive and the first penguin starts eating while all the
others look at him. After he finishes, all the penguins look at the
second penguin while he eats. When the last penguin finishes his
cereal, he emits a loud "gwank!" and all the penguins get up and file
out of the restaurant.
The guy looks up, folds up his newspaper, and gets up to pay the
bill. One of the other patrons had been watching the spectacle said,
"Excuse me sir, I have to ask. What was that all about? Why did you
just sit there while your penguins ate their breakfast?"
"Yeah, it always takes this long," he said. "It's cerealized."
The Story So Far
Last column, we started my Top-10, All-Time Favorite Evils to Avoid in
Parallel. As promised, it's so big that it takes two months
to cover. We covered the first five last month:
- Inconsistent environment / "dot" files
- Orphaning MPI requests
- MPI_PROBE
- Mixing Fortran (and C++) compilers
- Blaming MPI for programmer errors
So without further ado, from the home office in Bloomington, IN, let's
continue with number 5...
5: Re-Using a Buffer Prematurely
Recall that MPI's message passing behavior is mostly defined through
buffer semantics. Specifically, the MPI standard makes it clear that a
buffer can only be used in one communication at a time. It is
erroneous to use the same buffer in multiple, ongoing communications.
A common error is for MPI programs to start a non-blocking
communication to or from a buffer and then start a second one with the
same buffer before the first completes. There are two common cases:
concurrent reading and writing, and multiple concurrent reads.
Simultaneous reading and writing to the same buffer is clearly a race
condition. For example, if both a non-blocking send and a non-blocking
receive are simultaneously posted to the same buffer, there is no
guarantee in which order they will complete. Indeed, it may be
impossible to know exactly what is sent because it will depend on
exactly when the incoming message was received vs. when the outgoing
message was actually able to be sent.
Multiple concurrent readers is frequently seen as harmless (i.e.,
sending from the same buffer); surely multiple readers can't
cause a problem for the MPI implementation, can it?
Probably not.
But MPI still says that it's illegal, and it may cause
problems - even if the sends all complete normally. The rationale here
is that the MPI implementation may do something with "special" memory
in order to maximize performance. For example, networks based on
OS-bypass mechanisms may require the use of "pinned" memory - memory
that the operating system is disallowed from swapping out. This
requirement allows the OS-bypass-capable NIC to find the memory and be
guaranteed that it doesn't move while the network transfer takes
place.
An MPI implementation typically has to maintain some kind of state to
keep track of pinned memory. While such techniques usually involve
reference counting - the memory is not "un-pinned" by the MPI
implementation until all communications involving it have completed -
it is conceivable that an MPI implementation will not
reference count or otherwise perform error checking in order to
decrease overhead (and therefore decrease latency). This process can
result in the premature "un-pinning" of memory while other
communications are still ongoing, leading to run-time errors or other
unpredictable behavior.
4: Mixing MPI Implementations
It is not uncommon for someone to ask me a question about LA-MPI,
FT-MPI, MPICH, or one of several other MPI implementations. I always
politely reply that I work on LAM/MPI, and can't really answer
questions about those implementations. This situation is typically
more amusing to me than anything else, but it underscores the issue
that many users frequently do not distinguish between different MPI
implementations.
This misconception unfortunately spills over to the technology as
well; users compile their application with one implementation, try to
run it with another, and are confused when it does not work. Or,
worse, they run their application on multiple machines, each with a
different implementation installed (this is similar to but slightly
different from point #10: inconsistent environment / "dot"
files). This situation is almost guaranteed not to work.
Additionally, some users assume that the mpi.h
and mpif.h header files are interchangeable between MPI
implementations (or do not make the distinction). They are not;
indeed, the differences in these files are among the top-level reasons
that MPI implementations are incompatible with each other (e.g.,
types, constants, and macros will likely have conflicting values in
different implementations). Even worse, an MPI application
may compile properly with the wrong mpi.h file, but
then fail at run time in strange and mysterious ways.
The most common way to avoid this problem is to use the MPI
implementation's "wrapper" compilers for compiling and linking
applications. Most (but not all) MPI implementations provide commands
such as mpicc and mpif77 to compile C and Fortran 77
programs, respectively. These commands do nothing other than add
relevant command line arguments before invoking an underlying
compiler. They are typically the easiest way to ensure that the
"right" mpi.h and MPI library are used when compiling and
linking.
3: MPI_ANY_SOURCE
The use of MPI_ANY_SOURCE is convenient for programmers; it
is not uncommon for a message with the same signature to be able to
arrive from multiple sources. However, depending on the underlying
network and the MPI implementation, this may force extra overhead upon
receipt of the message. For example, the MPI implementation may be
required to associate the receive request with all possible
communication devices (which may entail spinning on polling all
devices). When a matching message arrives, the MPI implementation must
disassociate the request from all other devices. Not only does this
cause extra latency simply by necessitating N actions, it may involve
costly locking and unlocking mechanisms in multi-threaded programs.
When possible, to try to avoid using MPI_ANY_SOURCE. Instead,
it may be better to post N non-blocking receives - one for each source
from where the message may be received. This arrangement allows the
MPI to check only the relevant communication devices. Functions such
as MPI_WAITANY and MPI_TESTANY can be used to
determine when a message arrives. This situation is, of course, a
trade-off - if you have a message that legitimately may arrive
from any peer process, then MPI_ANY_SOURCE may
actually be more efficient than posting N receives. Other factors also
become relevant, such as the frequency of messages from each peer
(including strategies to avoid unexpected messages) - it depends on
the application.
|