MPI: Why Are There So Many MPI Implementations?

Article Index

User Threads

A fundamental decision that an MPI needs to make during the beginning of its development is whether to allow multiple user threads, and if so, whether to support concurrency within the MPI library. It is fundamentally easier for an MPI implementation to assume that there will only be one user thread in the library at a given time, either by only allowing single-threaded MPI applications or using a single, global mutex to protect all entry points to the library - effectively only allowing one thread into the library at a time.

When multiple, concurrent user threads are allowed, some form of locking must be used in the MPI library to protect internal data structures yet (assumedly) still allow fine-grained concurrency. For example, it is desirable to allow multiple threads executing MPI_SEND to progress more-or-less independently. Note that this may not be possible if both sends are going to the same destination (or otherwise must share the same network channel) or if the threads are running on the same CPU. But in general, the goal of allowing multiple user threads within the MPI library is to offer a high degree of concurrency wherever possible.

Unless this is considered during the initial design, it is difficult (if not impossible) to graft a fine-grained locking system onto the MPI implementation's internal progression engine(s). This issue is not really related to MPI, however, it is a design-for-threads issue.

Progress: Asynchronous or Polling?

Many MPI implementations only make progress on pending message passing operations when an MPI function is invoked. For example, even if an application started a non-blocking send with MPI_ISEND, the message may not be fully sent until MPI_TEST and/or MPI_WAIT is invoked. This procedure is common for single-threaded MPI implementations (although this is a different issue than allowing multiple simultaneous application-level threads in the MPI library).

Other MPI implementations offer true asynchronous progress, potentially utilizing specialized communication hardware or extra, hidden threads in the library that can make message passing progress regardless of what the application's threads are doing.

Designing for asynchronous progress really needs to be included from the start. Either specific hardware needs to be used or many of the same issues with multiple application threads need to be addressed. It is therefore difficult (if not impossible) to add true asynchronous support to a polling-only MPI implementation.

Sidebar: The Penalty of Fortran

Most MPI implementations are written in C and/or C++. In addition to C and C++ bindings, the MPI standard specifies language bindings in two flavors of Fortran: one that will work with Fortran 77 (and later) compilers and one that will work with Fortran 90 (and later) compilers.

For MPI implementations that provide them, the Fortran bindings are typically "wrapper" functions, meaning that they are actually written in C (or C++) and simply translate the Fortran arguments to C/C++ conventions before invoking a back-end implementation function. In many cases, the back-end function is the corresponding C function. For example, the Fortran binding for MPI_SEND performs argument translation and then invokes the C binding for MPI_SEND.

The argument translation may also involve some lookups - for example, converting Fortran integer handles into back-end structures or objects. In a threaded environment, this likely involves some form of locking.

Not all implementations work this way, but many do. It is worth investigating your MPI implementation's behavior if you are trying to squeeze every picosecond of performance out of your parallel environment.

Binary [In]Compatibility

Several of the issues discussed above (the types of MPI handles, the contents of MPI_Status, and the values of constants) can be simplified into a single phrase: have a common mpi.h and mpif.h. If all implementations used the same mpi.h and mpif.h, this would go a long way towards binary compatibility on a single platform.

However, as was recently pointed out to me, that's not really enough. Even though different libmpi.so instances could be used at run-time with a single executable, it is desirable to have a common mpirun as well (and other related MPI command line tools). This requirement means commonality between implementations of MPI_INIT - how to receive the list of processes in MPI_COMM_WORLD, their location, how to wait for or forcibly terminate a set of MPI processes, etc. It also has implications in the implementation of the MPI-2 dynamic process functions (MPI_COMM_SPAWN and friends). This situation translates to a unified run-time environment between MPI implementations.

Given the wide variety of run-time environments used by MPI implementations, this does not seem likely in the near future. Never say "never," of course, but the run-time environment comprises a good percentage of code in an MPI implementation - it is the back-end soul of the machine. More specifically: given that the MPI interface is standardized, there is at least a hope of someday specifying a common mpi.h and mpif.h. But the run-time environment in an MPI implementation is not specified in the MPI standard at all - there is little to no similarity between each implementation's run-time system. As such, merging them into a single, common system seems unlikely.

Where to Go From Here?

Yes, Virginia, MPI implementations are extremely complicated. Although binary compatibility is unlikely, source code compatibility has been and always will be available. This feature is part of the strength of MPI. The other is an unrelenting desire of developers to optimize the heck out of their MPI implementation. Take comfort that your code will not only run everywhere, it will likely run well everywhere.

Got any MPI questions you want answered? Wondering why one MPI does this and another does that? Send them to the MPI Monkey. {mosgoogle right}

Resources
MPI Forum (MPI-1 and MPI-2 specifications documents) http://www.mpi-forum.org/
MPI - The Complete Reference: Volume 1, The MPI Core (2nd ed) (The MIT Press) By Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra. ISBN 0-262-69215-5
MPI - The Complete Reference: Volume 2, The MPI Extensions (The MIT Press) By William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir, and Marc Snir. ISBN 0-262-57123-4.
NCSA MPI tutorial http://webct.ncsa.uiuc.edu:8900/public/MPI/

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Jeff Squyres is leading up Cisco's Open MPI efforts as part of the Server Virtualization Business Unit.

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.