Answers to this and other questions that keep you up at night.
We are the MPI. You will be assimilated. Your code and technological distinctiveness will be added to our own. Resistance is futile. Your code will run everywhere... won't it?
The Story So Far
In each of these columns, I am careful to distinguish between the MPI standards specification and the behavior of a given MPI implementation. There are many MPI implementations available - some vendors even have more than one. But why? Wasn't the goal of MPI to simplify all of this and make it easy to have portable parallel processing applications? I have personally seen clusters with over twenty different MPI implementations installed - it was each user's responsibility to determine which one they should use for their application (and set their PATH and other environment factors properly). This scenario is unfortunately not uncommon.
Indeed, with the myriad of different implementations available, independent software vendors (ISVs) attempting to sell closed-source parallel applications that use MPI typically have considerable logistical QA challenges. They already have to QA certify their application across a large number of hardware and operating system combinations; add a third dimension of MPI implementations, and the total number of platforms to QA certify against grows exponentially.
But Aren't MPI Applications Portable?
To be fair, the MPI Forum's goal was to enable source code portability, allowing users to recompile the same source code on different platforms with different MPI implementations. Even though some aspects of the MPI standard are not provided by all MPI implementations, MPI applications are largely source code portable across a wide variety of systems. Indeed, application source code portability is one of the largest contributing factors to the success of MPI.
Binary portability - the ability to run the same executable on multiple platforms (a la Java applets) or the ability to run the same executable with different MPI implementations on the same platform - was not one of the MPI Forum's original goals. As such, the MPI standard makes no effort to standardize the values of constants, the types of C handles, and several other surface-level aspects that make an MPI implementation distinct.
After MPI-2 was published, proposals have been periodically introduced for binary MPI interoperability (such as between the open source MPI implementations). Although these proposals have never succeeded, it has not been because the implementers think that this is a Bad Idea - reducing the logistics of users and ISVs is definitely a Good Thing™. They have failed because each MPI implementation has made fundamental design choices that preclude this kind of binary interoperability. More on this below.
Note that this goal says nothing about performance portability - the potential for unmodified applications to run with the same performance characteristics in multiple MPI implementations. Previous editions of this column have discussed the hazards about implied assumptions about your MPI implementation (e.g., whether MPI_SEND will block or not).
But the basic questions remain: why are there so many MPI implementations? And why are they so different?
To answer these questions, one really needs to look at what an MPI implementation has to provide to adhere to the standard, and then what the goals of that particular implementation are.
The Letter of the Law
As has been mentioned many times in this column, the MPI standard - consisting of two documents: MPI-1 and MPI-2 - is the bible to an MPI implementer. An implementation must adhere to all of the standard's definitions, semantics, and API details in order to be conformant.
At its core, an MPI implementation is about message passing - the seemingly simple act of moving bytes from one process to another. However, there are a large number of other services and data structures that accompany this core functionality. The MPI specification contains over 300 API functions and tens of pre-defined constants. Each of these API functions have specific, defined behavior (frequently related to other API functions) by which you must obey.
The data structures required to support such a complex web of interactions are, themselves, complex. Open MPI's internal communicator structure, for example, contains 24 members (17 of which either contain or point to other structures). The creation and run-time maintenance of these structures is an intricate task, requiring careful coding and painstaking debugging.
The Spirit of the Law
Even with the MPI standard, there are many places - both deliberate and [unfortunately] unintentional - where the text is ambiguous, and an MPI developer has to make a choice in the implementation. Should MPI_SEND block or return immediately? Should a given message be sent eagerly or use a rendezvous protocol? Should progress occur on an asynchronous or polling basis? Are user threads supported? Are errors handled? And if so, how?
And so on - the list is endless.
Each implementer answers these questions differently, largely depending on the goals of the specific implementation. Some MPI implementations are "research quality" and were created to study a specific set of experimental issues. Such implementations are likely to take short cuts in many areas and concentrate on their particular research topic(s). Other implementations are hardened/production quality, and must be able to run large parallel jobs for weeks at a time without leaking resources or crashing.
Some implementations are targeted at specific platforms, interconnects, run-time systems, etc., while others are designed to be portable across some subset of the (platform, network, run-time system) tuple. In some ways, writing single-purpose MPI implementations (e.g., for a specific set of hardware/network/run-time system) can be dramatically simpler than writing portable systems. Since it only has to work on one operating system, with one compiler, and one network, the code is far less complex than a portable system.
That being said, I've had discussions with developers of such single-system implementations and, despite the homogeneity of their target systems, their job is not easy. I've known developers who cheerfully break out logic analyzers to watch bus activity during an MPI run in order to fully understand all activity on the machine in order to further optimize their MPI. I even know of one [unnamed] vendor's implementation that used self-modifying code in order to avoid two cache misses and reduce latency by a few tens of nanoseconds. That particular trick had to get sign-offs from several levels of management in order to pass QA, but in the end, contributed to delivering an extremely high-performing MPI to the company's customers.
Let's take a short tour of some other choices that an MPI implementer has to make.
MPI Handles: Pointers or Integers?
This may seem like a trivial matter, but it has wide-reaching effects throughout the entire MPI implementation. A communicator, for example, has a bunch of internal data associated with it (the members of the group, the error-handler associated with it, whether the communicator is an inter- or intra-communicator, and so on). An implementation typically bundles all this information together in a C structure (or C++ object) and provides the application with some kind of handle to it. The handle that the application sees is of type MPI_Comm - but what should its real type be: a pointer to the structure/object, or an integer index into an array of all currently-allocated communicators?
Surprisingly, this issue incurs deep religious rifts between MPI implementers.
Using integers for handles means that there is no loss of performance between the C and Fortran bindings - both sets use indirect addressing to find the back-end structure (note that MPI specifically defines Fortran handles to be integers because Fortran - at least Fortran 77 - has no concept of a pointer). Note, however, that in multi-threaded environments, it is necessary to obtain a lock before examining the array because another thread may have grown (and therefore moved) the array.
Conversely, using pointers means that the Fortran bindings may have to perform translation from the integer to a pointer (probably through indirect addressing), but the C bindings can access the back-end data directly and have no need for additional lookup or locking of index arrays. Finally, on platforms where the size of a Fortran INTEGER is the same size as a pointer, this is a non-issue - each can be used interchangeably (e.g., the Fortran integer handle can actually be the C pointer value). This case is not true for all platforms, however.
The size of MPI handles is visible in mpi.h, and is therefore a key aspect of the MPI implementation's interface to user applications.
What's in an MPI_Status?
The MPI_Status object, as defined by the MPI standard, is different than all other MPI objects: not only does it have public data members, the user is responsible for allocating and freeing MPI_Status objects. This requirement means that its structure must be defined in mpi.h - including any internal data members (so that pointer math in the application can be accurate).
Although the standard disallows MPI applications from using the internal data members, the fact that MPI_Status is accessed by value (and not through a handle) means that its size is a key aspect of the MPI implementation's interface to user applications.