|
Page 2 of 2
Connect / Accept
The SPAWN functions are used for creating new MPI
jobs. But what about existing (potentially unrelated) MPI
jobs that want to establish communication between each other?
Taking inspiration from the TCP socket connect / accept model, the MPI
functions MPI_COMM_CONNECT and MPI_COMM_ACCEPT can
be used to emulate client-server functionality. Specifically, a
"server" process can invoke MPI_COMM_ACCEPT and wait for a
"client" process to invoke MPI_COMM_CONNECT to connect to
it. This sequence allows two independent MPI jobs to establish
communication between each other. Similar to the SPAWN
functions, the output of CONNECT and ACCEPT is an
intercommunicator.
Sidebar:
The Problem With Schedulers
|
Dynamic processes present many problems for MPI implementers, the most
notable of which is what to do in a scheduled environment. Most MPI
users have become accustomed to reserving enough nodes / CPUs for
their initial parallel job. For example, consider a scheduled cluster
where a user receives an allocation of four CPUs and launches a four
process MPI job (i.e., a MPI_COMM_WORLD size of four). If
this MPI application invokes MPI_COMM_SPAWN to launch eight
more processes, where should these processes be invoked?
- Oversubscribe the nodes: launch the eight new processes the
current allocation. This method is possible (and easy), but most HPC
applications will not want this because it will likely lead to
performance degradation, since multiple processes will be timesharing
each CPU.
- Launch on new nodes: this can only occur by obtaining new nodes /
CPUs from the scheduler. This action will most likely mean putting the
resource request at the end of the scheduler's queue, and may involve
a lengthy wait (minutes, hours, or even days). The result is a
blocked MPI_COMM_SPAWN for the entire time, potentially
wasting a lot of time in the current allocation.
Hence, this is still very much an open question for MPI
implementers. Indeed, some vendor MPI implementations have not
implemented the MPI-2 dynamic functionality only because they are
typically used in production scheduled environments where the focus is
to keep the computational resource full - there will never be free
resources to SPAWN new jobs on without waiting in the
scheduler's queue.
|
In the TCP model, IP addresses and port numbers are used to specify
the destination of a connect attempt. In MPI, such distinctions are
meaningless - some MPI implementations do not even support
TCP. Analogous to an (IP address, TCP port) tuple, MPI uses the
somewhat confusingly-named concept of "ports" as connection endpoints
(which have nothing to do with TCP ports).
A server process opens a port with a call
to MPI_OPEN_PORT. This port is passed
to MPI_COMM_ACCEPT to create a connection
endpoint. MPI_OPEN_PORT will also return the name of the port
in a dynamic, implementation-dependent string that can be used by the
client in its call to MPI_COMM_CONNECT. However, this is a
chicken-and-egg problem - how could the client know the server's port
name unless they already have some established form of communication?
MPI provides a port name lookup service: the server publishes its port
name under a well-known string (e.g., "server") with a call
to MPI_PUBLISH_NAME. The client
invokes MPI_LOOKUP_NAME with the well-known string ("server")
and obtains the server's port name, which is then used to
call MPI_COMM_CONNECT.
This publish / lookup system is analogous to how DNS is used to
translate human-readable names to IP addresses. For example, when a
user types www.yahoo.com into a browser, the browser performs
a DNS query to resolve this to an IP address that can be used to
connect to the server. Change www.yahoo.com to "server," and
"IP address" to "[MPI] port name," and the above example is probably
much clearer.
Join
The third and final dynamic process model in MPI-2
is MPI_COMM_JOIN. Two processes
invoke MPI_COMM_JOIN, each with one end of a common TCP
socket (MPI does not specify how this socket was created - it is the
application's responsibility). MPI can use the socket for startup
negotiation in order to establish its own communication
channel(s). Upon return from MPI_COMM_JOIN, the TCP socket is
drained (but still open) and an intercommunicator containing the two
processes (each in their own group) is returned.
Although this model may seem unnatural, having the application
establish the initial communication channel is valuable in that it
allows the use of an external connection mechanism (i.e., the
socket). This feature effectively provides an "escape" from the MPI
run-time environment and allows a potentially much wider range of
connectivity than is natively supported by the MPI implementation -
anywhere that the application can connect a socket.
Keep in mind that there is no guarantee that the MPI implementation
will be able to establish an intercommunicator with the process on the
remote end of the socket. For example, some MPI implementations are
geared towards operating system bypass networks; if there is no common
OS-bypass network between the two processes, the join may fail. Other
problematic scenarios may include intermediate firewalls or other
limited connectivity between peer processes.
Where to Go From Here?
We've covered the background and the basics of MPI-2 dynamic
processes. Next column, we'll provide some meaningful examples of why
and how they can be useful in HPC applications, especially when paired
with multi-threaded scenarios.
Got any MPI questions you want answered? Wondering why one MPI
does this and another does that? Send
them to the MPI Monkey.
Resources
| MPI Forum (MPI-1 and MPI-2 specifications documents) |
http://www.mpi-forum.org/ |
| MPI - The Complete Reference: Volume 1, The MPI Core (2nd ed) (The
MIT Press) |
By Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and
Jack Dongarra. ISBN 0-262-69215-5 |
| MPI - The Complete Reference: Volume 2, The MPI Extensions (The
MIT Press) |
By William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing
Lusk, Bill Nitzberg, William Saphir, and Marc Snir. ISBN
0-262-57123-4. |
| NCSA MPI tutorial |
http://webct.ncsa.uiuc.edu:8900/public/MPI/ |
This article was originally published in ClusterWorld Magazine. It
has been updated and formatted for the web. If you want to read more
about HPC clusters and Linux, you may wish to visit
Linux Magazine.
Jeff Squyres is leading up Cisco's Open MPI efforts as part of
the Server Virtualization Business Unit.
Comment on this article
You must login to leave comments...
Other Visitors Comments
There are no comments currently....
|