MPI: The Spawn of MPI

Connect / Accept

The SPAWN functions are used for creating new MPI jobs. But what about existing (potentially unrelated) MPI jobs that want to establish communication between each other?

Taking inspiration from the TCP socket connect / accept model, the MPI functions MPI_COMM_CONNECT and MPI_COMM_ACCEPT can be used to emulate client-server functionality. Specifically, a "server" process can invoke MPI_COMM_ACCEPT and wait for a "client" process to invoke MPI_COMM_CONNECT to connect to it. This sequence allows two independent MPI jobs to establish communication between each other. Similar to the SPAWN functions, the output of CONNECT and ACCEPT is an intercommunicator.

Sidebar: The Problem With Schedulers

Dynamic processes present many problems for MPI implementers, the most notable of which is what to do in a scheduled environment. Most MPI users have become accustomed to reserving enough nodes / CPUs for their initial parallel job. For example, consider a scheduled cluster where a user receives an allocation of four CPUs and launches a four process MPI job (i.e., a MPI_COMM_WORLD size of four). If this MPI application invokes MPI_COMM_SPAWN to launch eight more processes, where should these processes be invoked?

  • Oversubscribe the nodes: launch the eight new processes the current allocation. This method is possible (and easy), but most HPC applications will not want this because it will likely lead to performance degradation, since multiple processes will be timesharing each CPU.
  • Launch on new nodes: this can only occur by obtaining new nodes / CPUs from the scheduler. This action will most likely mean putting the resource request at the end of the scheduler's queue, and may involve a lengthy wait (minutes, hours, or even days). The result is a blocked MPI_COMM_SPAWN for the entire time, potentially wasting a lot of time in the current allocation.

Hence, this is still very much an open question for MPI implementers. Indeed, some vendor MPI implementations have not implemented the MPI-2 dynamic functionality only because they are typically used in production scheduled environments where the focus is to keep the computational resource full - there will never be free resources to SPAWN new jobs on without waiting in the scheduler's queue.

In the TCP model, IP addresses and port numbers are used to specify the destination of a connect attempt. In MPI, such distinctions are meaningless - some MPI implementations do not even support TCP. Analogous to an (IP address, TCP port) tuple, MPI uses the somewhat confusingly-named concept of "ports" as connection endpoints (which have nothing to do with TCP ports).

A server process opens a port with a call to MPI_OPEN_PORT. This port is passed to MPI_COMM_ACCEPT to create a connection endpoint. MPI_OPEN_PORT will also return the name of the port in a dynamic, implementation-dependent string that can be used by the client in its call to MPI_COMM_CONNECT. However, this is a chicken-and-egg problem - how could the client know the server's port name unless they already have some established form of communication?

MPI provides a port name lookup service: the server publishes its port name under a well-known string (e.g., "server") with a call to MPI_PUBLISH_NAME. The client invokes MPI_LOOKUP_NAME with the well-known string ("server") and obtains the server's port name, which is then used to call MPI_COMM_CONNECT.

This publish / lookup system is analogous to how DNS is used to translate human-readable names to IP addresses. For example, when a user types www.yahoo.com into a browser, the browser performs a DNS query to resolve this to an IP address that can be used to connect to the server. Change www.yahoo.com to "server," and "IP address" to "[MPI] port name," and the above example is probably much clearer.

Join

The third and final dynamic process model in MPI-2 is MPI_COMM_JOIN. Two processes invoke MPI_COMM_JOIN, each with one end of a common TCP socket (MPI does not specify how this socket was created - it is the application's responsibility). MPI can use the socket for startup negotiation in order to establish its own communication channel(s). Upon return from MPI_COMM_JOIN, the TCP socket is drained (but still open) and an intercommunicator containing the two processes (each in their own group) is returned.

Although this model may seem unnatural, having the application establish the initial communication channel is valuable in that it allows the use of an external connection mechanism (i.e., the socket). This feature effectively provides an "escape" from the MPI run-time environment and allows a potentially much wider range of connectivity than is natively supported by the MPI implementation - anywhere that the application can connect a socket.

Keep in mind that there is no guarantee that the MPI implementation will be able to establish an intercommunicator with the process on the remote end of the socket. For example, some MPI implementations are geared towards operating system bypass networks; if there is no common OS-bypass network between the two processes, the join may fail. Other problematic scenarios may include intermediate firewalls or other limited connectivity between peer processes.

Where to Go From Here?

We've covered the background and the basics of MPI-2 dynamic processes. Next column, we'll provide some meaningful examples of why and how they can be useful in HPC applications, especially when paired with multi-threaded scenarios.

Got any MPI questions you want answered? Wondering why one MPI does this and another does that? Send them to the MPI Monkey. {mosgoogle right}

Resources
MPI Forum (MPI-1 and MPI-2 specifications documents) http://www.mpi-forum.org/
MPI - The Complete Reference: Volume 1, The MPI Core (2nd ed) (The MIT Press) By Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra. ISBN 0-262-69215-5
MPI - The Complete Reference: Volume 2, The MPI Extensions (The MIT Press) By William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir, and Marc Snir. ISBN 0-262-57123-4.
NCSA MPI tutorial http://webct.ncsa.uiuc.edu:8900/public/MPI/

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Jeff Squyres is leading up Cisco's Open MPI efforts as part of the Server Virtualization Business Unit.

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.