Does MPI mean Many Possible Interconnects?
Have you ever searched for "MPI" on EBay? It's fantastic! You can buy cars, pens, deuterium lamps, PVC piping, power transformers - all with MPI! I offer this as proof positive to the message passing naysayers: with MPI integrated into all of these common, real-world products, MPI is here to stay!
Pardon me; I have to go place some bids...
The Story So Far
Last column, we talked about some of the issues involved with a single-threaded MPI implementation making "progress" on message passing. Specifically, the discussion centered around TCP sockets-based implementations. Unavoidable sources of overhead were discussed, such as MPI envelopes and operating system TCP handling.
In this edition, we'll discuss operating system-bypass types of networks such as Myrinet, Infiniband, and Quadrics. Although we won't cover many specifics of these three networks, the message passing issues are fairly similar across all three - and quite different than TCP-based networks.
A new MPI community resource has recently opened its doors: a wiki all about MPI - http://www.mpi-comm-world.org/. The intent is to provide an MPI support site for the community, by the community. Any information about MPI is fair game - information about the standard, questions (and answers) about specific MPI implementations, etc. You are hereby cordially invited to submit information that you wish you had known when you started with MPI.
There is a password on the site to prevent spam defacements: "MPI_COMM_WORLD" (without the quotes). Enjoy!
What Are OS-Bypass Networks?
One source of the overhead involved in sending and receiving data from a network is the time necessary to trap into the operating system kernel. In some cases, this trap can occur multiple times, further increasing overhead. OS-bypass networks attempt to minimize or eliminate the costly trap into the kernel - bypassing the operating system and using communication co-processors located on the network card to handle the work instead of the main CPU.
Ignoring lots of details, this typically involves either modifications to the kernel or (more recently) loading a module into the kernel to provide both underlying support for user-level message passing and a device driver to talk to the network card. In the user's application (typically within the MPI library), message sending occurs by placing information about the outgoing message in a control structure and then notifying the communication co-processor to start the send. Incoming messages are automatically received by the co-processor and a control structure is placed in the user application's memory, where it can be found when the application polls for progress.
Differences from TCP
The design of OS-bypass networks make their usage quite different than TCP-based networks. Here are some of the differences:
- Messages are transferred as single units; there are no partial sends and receives.
- Message and the ordering between them may not be guaranteed.
- All sent messages must have a corresponding pre-posted receive (no message can be unexpected).
- Resources are limited; e.g., "special" memory may have to be used for all message passing.
Let's look at each of these in detail.
Unlike the other three, this characteristic generally makes message passing easierthan TCP. One of the most annoying aspects of writing a TCP-based progression engine is the fact that it has to maintain state about who "owns" the socket, how far along a read or write is in a given buffer, etc. Each pass through the progression engine has to survey all these values, progress them, and then update the internal accounting information. It's not rocket science, but it is quite cumbersome and annoying.
Newer generation networks present interfaces that transfer messages as atomic units. If you send N bytes, the receiver will receive N bytes - no more, no less. That's a whole lot of logic that does not need to be included in the MPI transfer layer (as compared to its TCP analogue).
TCP guarantees that any bytes sent will be received - there is never any need for retransmission. Other networks do not necessarily provide this guarantee. Packets can be dropped or otherwise corrupted (solar and atmospheric activity, believe it or not, can actually be a factor at high altitudes!); the MPI layer is typically the entity that has to watch for - and correct - these kinds of problems. This requirement translates into at least some additional amount of code and error conditions that the MPI layer has to handle.
Dropped packets have to be handled by the MPI implementation. Some networks' default handling of lost packets is to drop allpackets to a given peer. The MPI layer must then effectively replay all the messages that it sent to that peer - potentially in the same order that they were originally sent (depending on how the MPI implementation's wire protocols work) - to recover.
Data corruption monitoring, although it seems like a desirable trait, is not something that most users will tolerate in an MPI implementation. Because [potentially expensive] data integrity checking needs to be inserted in the critical code path of message passing, this step, by definition, increases message passing latency. Combined with the fact that data corruption is rare, few MPI implementations support data integrity and simply assume that the underlying network will always deliver correct data.
No Unexpected Messages
Most low latency networks derive at least some of their speed from the fact that they can assume that all messages are expected - that there is a corresponding user-provided buffer ready to receive each incoming message (unexpected messages are errors, and can lead to dropped packets). The MPI implementation therefore needs to pre-post receive buffers (for envelopes at least), potentially from each of its peers.
Some networks can only use "special" memory for all message passing. Myrinet, Infiniband, and older versions of Quadrics need to use memory that has been registered with the network driver. One of the key actions that registering memory accomplishes is that the operating system "pins" down the virtual page(s) where the memory is located and guarantees that it(they) will never be swapped out or moved elsewhere in physical memory. This restriction allows the communication co-processor on the network card to DMA transfer to and from the user's buffer without fear of the operating system moving it during the process. Operating systems have limits on how much memory can be pinned; typically anywhere from 1/4 to 1/2 of physical memory. The MPI implementation is usually responsible for managing registered memory.
However, since all receives must be pre-posted, the MPI layer must setup to receive some number of envelopes (possibly on a per-peer basis). Two obvious questions that come from this:
- How many envelopes should be pre-posted?
- How large should the envelopes be?
Pre-posting the buffers consumes system resources (e.g., registered memory), but posting too few buffers for envelopes can degrade performance. If too few are posted, flow control issues can consume too much processing power and cause "dead air" time on the network; if too many are posted, the system can run out of registered memory.
The size of the envelopes is another factor. MPI implementations commonly have three message sizes: tiny, small, and large. Tiny messages are included in the payload of the envelope itself, and can therefore be sent eagerly in a single network transfer. Small messages are also sent eagerly, but in a second network transfer (i.e., immediately after the envelope). Large messages are sent via rendezvous protocols.
Hence, the size of the envelope is really the maximumsize of the envelope. Specifically, the MPI implementation will transfer only as much of the envelope as is necessary. For tiny messages, normally the header and the payload are sent; for short and long messages, only the header is sent. Choosing the maximum sizes for tiny and short messages is therefore a complicated choice (and will not be covered in this column).
Increasing the maximum size of envelopes can increase performance for applications that send relatively small messages. That is, if the user application's common message size is N, setting the maximum envelope size to be greater than or equal to N means that the application's messages will be sent eagerly with a single message transfer (subject to flow control, of course). But this also consumes system resources and can potentially exhaust available registered memory.
- Next >>