Article Index

Progress is good, so they say

My mother called me the other day, tremendously concerned. "Are you going to get sued?" she demanded. "Are you going to have to pay a fine, or go to jail?"

"Err... Mom, I have no idea what you're talking about."

"That 'work' you do. I heard on the radio today that the recording industry is suing hundreds of people for illegally downloading music on the Internet."

After a long conversation detailing the difference between MP3 files and MPI, my mom is still convinced that I'm going to jail.

(my [real] mother insists that I provide the following disclaimer: the above story is entirely fictional; she's mad as hell and isn't going to take it anymore)

The Story So Far

Last column, we talked about some of the fundamental architectural decisions that an MPI implementation has to make during its design phases. This column, we'll take a closer look at some point-to-point message passing issues and implementation details. The performance on various platforms and networks can vary wildly, not just because of the quality of the MPI implementation (although this is a frequently overlooked issue) but because of how progress is made in the underlying MPI engine, which is frequently defined in terms of how the underlying network(s) allow messages to be sent and received.

The inner workings of how an MPI implementation performs the seemingly simple task of moving bytes from one process to another may surprise you. More importantly, having at least a basic understanding of what your MPI is doing may enable you to write your application to match its underlying semantics, and therefore squeeze just a little bit more performance out of your environment.

Sidebar: Common MPI envelope data
MPI implementations typically have to include many of the following common data when messages are sent across a network. Sometimes the data must be explicitly included in the message, sometimes the network transport itself directly or indirectly implies several of the values:
  • Envelope type: If the MPI implementation uses different types (and sizes) of envelopes for different situations, an identifier must be included to indicate what kind of envelope is being used.
  • Communicator ID: Almost always required (may not be required if the communicator can be otherwise identified, such as having a different TCP socket for each communicator, but this may be tremendously inefficient in terms of resources).
  • Tag: Almost always required (ditto with communicator ID).
  • Originator: Some networks can naturally provide from whom an envelope was received (e.g., TCP), but others cannot (e.g., gm/Myrinet). If the network cannot provide the information, the originator of the envelope must be included in the envelope.
  • Message length: Even if the network or operating system can provide this, it is frequently more efficient if the MPI implementation knows how many bytes to receive in the payload.
  • Flags: Typically a bit field, indicates information such as: whether the envelope contains an ACK, whether the message requires an ACK, etc.
  • Sequence number: Some MPI implementation may send messages out of order (for efficiency, or if messages may take multiple paths to the destination); a sequence number may be required to enforce MPI's message ordering guarantees.
  • Peer request pointer: For messages that require multiple sends (rendezvous protocols, synchronous messages, etc.), including a pointer to the source request can avoid searching in memory for the matching message. For example, if a sender includes a pointer to its MPI request entry in the envelope, and that address is echoed back in the ACK from the receiver, the sender knows exactly which request has been acknowledged.

Message Envelopes

Keep in mind that the body of the message is not the only content that needs to be sent across the network. A prologue - frequently called an envelope - must also be sent so that the receiver knows what the message is. For example, the envelope usually contains identifiers for the message's communicator and tag. It usually contains a small amount of other information as well, but this varies from implementation to implementation (and potentially even between network types in the same MPI implementation). Regardless, the overall envelope size is kept fairly small; enlarging it will only increase the latency of small messages - consider that sending a 1-byte user message must also send a corresponding envelope.

Some MPI implementations send different types (and sizes) of envelopes depending on the message; Sidebar 1 shows a list of common elements that may be included in envelopes. An implementation's goal is typically to send the shortest envelope possible to reduce the latency of "simple" messages, such as short, non-synchronous messages.

MPI Progress

"Progress" is typically defined as the advancement of pending communications - completing sends and/or receives that were initiated previously. This includes both blocking and non-blocking communications. For simplicity, we'll focus on single-threaded MPI implementations here; the issues facing multi-threaded implementations are best described as "analogous" - they're similar in spirit, but the deep, dark details are different.

What exactly is an MPI implementation doing while it is "making progress"? How do the bytes get from process A to process B?

The answer is inevitably different in each MPI implementation, but let's look at a few common examples to see what is happening under the covers in a typical TCP sockets-based implementation (next column, we'll delve into non-sockets based implementations). These examples are likely not exactly how any one MPI implementation works, but are influenced by several real-world MPI progress engines.


Most sockets-based implementations open at most one socket for point-to-point MPI messages between pairs of processes; all traffic is multiplexed across the socket.

If no other communications are pending, blocking sends are [deceptively] easy - the writev(2) system call can send both the envelope and payload in a single function call (and potentially in a single TCP packet). The operating system may or may not block... but who cares? MPI is allowed to block, so that point is irrelevant. Also recall that when write() and writev() return, we have no guarantees from the operating system that the message has actually been received (or even sent!). All we know is that the buffer is available for re-use, and therefore MPI can return from the blocking send.

Similarly, blocking receives can be implemented with calls to read(2). Using eager or rendezvous protocols, the algorithm is simple because the MPI is allowed to block in writev(2) and read(2). The following simplistic pseudocode outlines both cases:

Listing 1: Blocking sockets code
1 if (size <= eager_size) {
2     fill_envelope(env,iov);
3     writev(fd,iov,2);
4 } else {
5     fill_envelope(env,iov);
6     write(fd,env,sizeof(env));
7     wait_for_ack(fd);
8     writev(fd,iov,2);
9 }

A real MPI implementation will likely have a more complex state machine than this, such as allowing other communications to progress (e.g., other non-blocking communication, unexpected receives, etc.) during significant blocking periods (e.g., the wait_for_ack() function could block for an arbitrarily long time). But you get the idea.

You have no rights to post comments


Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.