MPI: What Really Happens During MPI_INIT?

Behind the scenes at MPI studios

In the previous two installments, we covered the basics and fundamentals: what MPI is, some simple MPI example programs, and how to compile and run them. For this column, we will detail what happens in MPI_INIT in a simple MPI application (the "ping-pong" example program in Listing 1).

The Story So Far

In the last column, we covered the basics and fundamentals: what MPI is, some a simple MPI example program, and how to compile and run the program. In this installment, let's dive into a common terminology misconception: processes vs. processors - they're not necessarily related!

In this context, a processor typically refers to a CPU. Typical cluster configurations utilize uniprocessors or small Symmetric Multi Processor (SMP) nodes (e.g., 2-4 CPUs each). Hence, "processor" has a physical - and finite - meaning.

Last column, I said that MPI is described mostly in terms of "MPI processes," where the exact definition of "MPI process" is up to the implementation (it is usually a process or a thread). An MPI application is composed of one or more MPI processes. It is up to the MPI implementation to map MPI processes onto processors.

Threads and (MPI) Processes

Most MPI implementations define an MPI process to be a Windows or POSIX process. Hence, each MPI process has its own global variables, environment, and does not need to be thread-safe. Some MPI implementations, however, do define MPI processes as threads. The Adaptive MPI (AMPI) project from the University of Illinois, for example, uses this model.

Other notable items about MPI, threads, and processes:

  • The MPI standard does not define interactions of MPI processes with non-MPI processes. Specifically, what happens when an MPI process invokes fork(2) is implementation-dependent.
  • Although the MPI-2 document does define the behavior of threads in an MPI process, an MPI implementation may or may not support concurrency in multi-threaded MPI applications.

Mapping MPI Processes to Processors

An implementation may allow you to run M processes on N processors, where M may be less than, equal to, or greater than N. Although maximum performance is typically achieved when each process has its own processor (i.e., when M <= N), there are cases where over-subscribing processors is useful as well (i.e., where M > N).

Table 1 gives a brief description of each possible scenario.

Table 1: Oversubscribing Scenarios
Scenario Description


Less processes than processors Resources are potentially underutilized, unless additional threads are spawned on unused processors
One process per processor Resources are fully utilized, potentially running more than one MPI process per node
More processes than processors Resources are oversubscribed, likely degrading overall performance

When there the number of processes is less than or equal to the number of processors, the application will run at its peak performance. Since the total system is either underutilized (there are unused processors) or fully utilized (all processors are being used), the application is not hindered by context switching, cache misses, or virtual memory thrashing caused by other local processes.

The "underutilized" model may also be somewhat misleading. It is not uncommon for an application to use MPI to launch one process per node (and therefore have processors on a node that are not initially used) and spawn computation threads to utilize the additional processors. As such, shared memory/threaded programming techniques are used for on-node coordination and data transfer; MPI is used for off-node message passing. Combined MPI and OpenMP applications use this model, for example. {mosgoogle right}

Over-subscribing processors, where more processors are launched than there are physical processors, is typically only used for development and testing, or when access to large parallel resources (such as a production cluster) are limited, expensive, or otherwise constrained. Hence, even though the overall application is almost guaranteed to run with some level of performance degradation, this scenario can be useful to isolate problems, identify performance bottlenecks, or cause artificial race conditions. It can be quite difficult to debug a 4,096 process parallel application; scaling down and running 32 processes (perhaps even on a handful of development workstations, depending on the nature of the application) can make the difference between an impossible-to-locate-and-replicate Heisenbug and an easily-identifiable-and-fixable typo in the code.

It is common to develop and debug parallel applications with a small number of processes (e.g., 2, 4, or 8) on a single workstation. As the application becomes more fully developed and stable, larger testing runs can be conducted on actual clusters to check for scalability and performance bottlenecks.

The Art of Over-subscribing

Most MPI implementations will allow running arbitrary numbers of MPI processes, regardless of the number of available processors. This is somewhat of a black art - if you run too many processes, the processors will thrash, continually trying to give each process its fair share of run time. If you run too few, you may not be able to run meaningful data through your application, or may not trigger error conditions that occur with larger numbers of processes.

For example, running 8 computational and memory-intensive MPI processes on a single uniprocessor workstation will likely result in the machine slowing to a crawl while the cache, virtual paging system, and process schedulers are thrashed beyond reasonable bounds. Conversely, running only 2 lightly-computational, tightly-synchronized processes on a uniprocessor workstation may be acceptable in terms of performance, but may fail to show errors that only occur when running with an odd number of processes.

Different MPI Implementations

As mentioned in the last edition of this column, there are many different implementations of MPI available. In this column, we'll go step-by-step in using four different MPI implementations. All examples will use the sample "Hello world" MPI program from our last column (see Listing 1).

Listing 1: Sample "Hello world" MPI application
 1 #include <stdio.h>
 2 #include <mpi.h>
 3
 4 int main(int argc, char **argv) {
 5   int rank, size;
 6
 7   MPI_Init(&argc, &argv);
 8   MPI_Comm_size(MPI_COMM_WORLD, &size);
 9   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
10   printf("Hello, world. I am %d of %d.\n", rank, size);
11   MPI_Finalize();
12   return 0;
13 }

The four implementations that we'll focus on are all open source and freely available:

  • FT-MPI v1.0 from the University of Tennessee
  • LA-MPI v1.3.6 from Los Alamos National Laboratory
  • LAM/MPI v7.0.2 from Indiana University
  • MPICH v1.2.5.2 from Argonne National Labs

Each implementation has its particular strengths and weaknesses (to be discussed in future columns). Here, we'll focus simply on compiling under each implementation and then running in a few different scenarios. It should be noted that we'll only cover common scenarios in each implementation; consult the extensive documentation and manual pages available with each implementation for more details.

Compiling

Both LAM/MPI and MPICH all offer a mpicc "wrapper" for compiling and linking C MPI programs (and corresponding mpif77 and mpiCC for Fortran and C++ programs), making compilation and linking easy (provided your environment variables are set correctly):

$ mpicc hello.c -o hello

FT-MPI has wrapper compilers, but it is named ftmpicc (and ftmpif77). It behaves identically to LAM/MPI's mpicc.

LA-MPI does not provide wrapper compilers; note the following when compiling MPI applications with LA-MPI:

  • A C++ compiler must be used for linking LA-MPI applications
  • Depending on where it was installed, you may need to provide the relevant -I, -L flags.
  • You may need to provide additional linker flags (e.g., -pthread)
  • You need to provide -lmpi to the linker

This sounds scary; it's not. Most of the time, these details are hidden in a Makefile and are therefore unnoticed by the user. In this example, LA-MPI was installed on a Linux machine with the GNU compilers in /usr/lampi:

$ g++ hello.c -I/usr/lampi/include -L/usr/lampi/lib -pthread -lmpi -o hello.la-mpi

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.