|
Page 1 of 2 Progress is good, so they say
My mother called me the other day, tremendously concerned. "Are you
going to get sued?" she demanded. "Are you going to have to pay a
fine, or go to jail?"
"Err... Mom, I have no idea what you're talking about."
"That 'work' you do. I heard on the radio today that the recording
industry is suing hundreds of people for illegally downloading music
on the Internet."
After a long conversation detailing the difference between MP3 files
and MPI, my mom is still convinced that I'm going to jail.
(my [real] mother insists that I provide the following disclaimer: the
above story is entirely fictional; she's mad as hell and isn't going
to take it anymore)
The Story So Far
Last column, we talked about some of the fundamental architectural
decisions that an MPI implementation has to make during its design
phases. This column, we'll take a closer look at some point-to-point
message passing issues and implementation details. The performance on
various platforms and networks can vary wildly, not just because of
the quality of the MPI implementation (although this is a frequently
overlooked issue) but because of how progress is made in the
underlying MPI engine, which is frequently defined in terms of how the
underlying network(s) allow messages to be sent and received.
The inner workings of how an MPI implementation performs the seemingly
simple task of moving bytes from one process to another may surprise
you. More importantly, having at least a basic understanding of what
your MPI is doing may enable you to write your application to match
its underlying semantics, and therefore squeeze just a little bit more
performance out of your environment.
Sidebar:
Common MPI envelope data
MPI implementations typically have to include many of the following
common data when messages are sent across a network. Sometimes the data
must be explicitly included in the message, sometimes the network
transport itself directly or indirectly implies several of the values:
- Envelope type: If the MPI implementation uses
different types (and sizes) of envelopes for different situations, an
identifier must be included to indicate what kind of envelope is being
used.
- Communicator ID: Almost always required (may not
be required if the communicator can be otherwise identified, such as
having a different TCP socket for each communicator, but this may be
tremendously inefficient in terms of resources).
- Tag: Almost always required (ditto with
communicator ID).
- Originator: Some networks can naturally provide
from whom an envelope was received (e.g., TCP), but others cannot
(e.g., gm/Myrinet). If the network cannot provide the information, the
originator of the envelope must be included in the envelope.
- Message length: Even if the network or operating
system can provide this, it is frequently more efficient if the MPI
implementation knows how many bytes to receive in the payload.
- Flags: Typically a bit field, indicates
information such as: whether the envelope contains an ACK, whether the
message requires an ACK, etc.
- Sequence number: Some MPI implementation may
send messages out of order (for efficiency, or if messages may take
multiple paths to the destination); a sequence number may be required
to enforce MPI's message ordering guarantees.
- Peer request pointer: For messages that require
multiple sends (rendezvous protocols, synchronous messages, etc.),
including a pointer to the source request can avoid searching in
memory for the matching message. For example, if a sender includes a
pointer to its MPI request entry in the envelope, and that address is
echoed back in the ACK from the receiver, the sender knows exactly
which request has been acknowledged.
|
Message Envelopes
Keep in mind that the body of the message is not the only content that
needs to be sent across the network. A prologue - frequently called
an envelope - must also be sent so that the receiver knows
what the message is. For example, the envelope usually contains
identifiers for the message's communicator and tag. It usually
contains a small amount of other information as well, but this varies
from implementation to implementation (and potentially even between
network types in the same MPI implementation). Regardless, the overall
envelope size is kept fairly small; enlarging it will only increase
the latency of small messages - consider that sending a 1-byte user
message must also send a corresponding envelope.
Some MPI implementations send different types (and sizes) of envelopes
depending on the message; Sidebar
1 shows a list of common elements that may be included in
envelopes. An implementation's goal is typically to send the shortest
envelope possible to reduce the latency of "simple" messages, such as
short, non-synchronous messages.
MPI Progress
"Progress" is typically defined as the advancement of pending
communications - completing sends and/or receives that were initiated
previously. This includes both blocking and non-blocking
communications. For simplicity, we'll focus on single-threaded MPI
implementations here; the issues facing multi-threaded implementations
are best described as "analogous" - they're similar in spirit, but the
deep, dark details are different.
What exactly is an MPI implementation doing while it is
"making progress"? How do the bytes get from process A to process B?
The answer is inevitably different in each MPI implementation, but
let's look at a few common examples to see what is happening under the
covers in a typical TCP sockets-based implementation (next column,
we'll delve into non-sockets based implementations). These examples
are likely not exactly how any one MPI implementation works,
but are influenced by several real-world MPI progress engines.
Sockets
Most sockets-based implementations open at most one socket for
point-to-point MPI messages between pairs of processes; all traffic is
multiplexed across the socket.
If no other communications are pending, blocking sends are
[deceptively] easy - the writev(2) system call can send both
the envelope and payload in a single function call (and potentially in
a single TCP packet). The operating system may or may not block... but
who cares? MPI is allowed to block, so that point is irrelevant. Also
recall that when write() and writev() return, we
have no guarantees from the operating system that the message has
actually been received (or even sent!). All we know is that the buffer
is available for re-use, and therefore MPI can return from the
blocking send.
Similarly, blocking receives can be implemented with calls
to read(2). Using eager or rendezvous protocols, the
algorithm is simple because the MPI is allowed to block
in writev(2) and read(2). The following simplistic
pseudocode outlines both cases:
Listing 1:
Blocking sockets code
1 if (size <= eager_size) {
2 fill_envelope(env,iov);
3 writev(fd,iov,2);
4 } else {
5 fill_envelope(env,iov);
6 write(fd,env,sizeof(env));
7 wait_for_ack(fd);
8 writev(fd,iov,2);
9 }
|
A real MPI implementation will likely have a more complex state
machine than this, such as allowing other communications to progress
(e.g., other non-blocking communication, unexpected receives, etc.)
during significant blocking periods (e.g., the wait_for_ack()
function could block for an arbitrarily long time). But you get the
idea.
|