MPI: Zen and the Art of MPI Collectives

Article Index

Broadcast

Another simple collective operation to describe is the broadcast: data is sent from one process to all other processes in a communicator. Its function prototype is similar to MPI_SEND; it takes a buffer, count, MPI datatype, and communicator - just like MPI_SEND. But rather than requiring a destination rank and tag, MPI_BCAST accepts a root rank specifying which process contains the source buffer. Listing 1 shows a simple program using MPI_BCAST.

Listing 1: Simple broadcast MPI program
 1 void simple_broadcast(void) {
 2  int rank, value;
 3  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
 4  if (rank == 0) {
 5   printf("Enter a value: ");
 6   scanf("%d", &value);
 7  }
 8  MPI_Bcast(&value,1,MPI_INT,0,MPI_COMM_WORLD);
 9  printf("Rank %d has value: %d\n",rank, value);
10 }

MPI_COMM_WORLD rank 0 will prompt for an integer and then broadcast it to all other processes. Note that all processes call MPI_BCAST in exactly the same way; the same parameters are used in each process. At the root (MPI_COMM_WORLD rank 0), the value variable is used as an input buffer; value is used as an output buffer in all other processes. After MPI_BCAST returns, all processes have the same value in value.

Reduction Operations

Another type of common collective operation is reductions. Pre-defined and user-defined operations can be applied to data as it is combined to form a single answer. A simple program showing a global sum is shown in Listing 2.

Listing 2: Simple reduction MPI program
 1 void simple_reduction(void) {
 2   int rank, sum;
 3   MPI_Comm_rank(MPI_COMM_WORLD,&rank);
 4   MPI_Reduce(&rank,&sum,1,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD);
 5   if (rank == 0)
 6    printf("Sum of rank values: %d\n",sum);
 7 }

MPI_SUM is a predefined operation that computes the sum of the input buffers provided by all processes. The resulting sum is placed in the output buffer, sum. Note that just like MPI_BCAST, all processes execute the same collective function - but only the root (MPI_COMM_WORLD rank 0) receives the resulting sum value. On all other processes, the value of the sum variable is unmodified by MPI.

Sidebar: Will that collective block?
As mentioned earlier in the column, the only collective that guarantees to block is MPI_BARRIER. All other collectives are defined to block only until their portion of the collective is complete. In some cases - depending on how the particular collective algorithm is implemented - this may be immediately. In other cases, processes may block, but for varying amounts of time.

Consider MPI_GATHER - an operation where every process sends its buffer to the root. As soon as each process sends its buffer, it can return. In this scenario, the return from MPI_GATHER on non-root ranks does not imply anything about the completion of MPI_GATHER on any other process in the communicator. The only thing that is known is that the root process will be the last one to complete.

The function MPI_ALLREDUCE operates in the same way as MPI_REDUCE except that all processes receive the answer, not just the root. You can think of it as an MPI_REDUCE immediately followed by an MPI_BCAST (although, for optimization reasons, it may not be implemented that way).

MPI has several other pre-defined operations, including (but not limited to): maximum, minimum, product, logical and bit-wise AND, OR, and XOR, and maximum/minimum location (essentially for finding the process rank with the maximum/minimum value)

Other Collective Operations

MPI has other collective operations that are worth investigating, such as: scatter, gather, all-to-all, and both internal and external scan. Some of these operations have multiple variants; for example, there is both a rooted gather (where one process receives all the data) and an "allgather" (where all processes receive all the data).

These operations are described in detail in the MPI-1 and MPI-2 standards documents.

Where To Go From Here?

The short version of the column is: MPI collectives are your friends. Use them. Don't code up your own collective algorithms unless you really need to. If the collectives in your MPI implementation perform poorly, write to your Congressman.

Communicators were mentioned frequently this month; next month, we'll discuss them in detail along with their partner in crime: MPI groups.

{mosgoogle right}

Resources
MPI Forum (MPI-1 and MPI-2 specifications documents) http://www.mpi-forum.org
MPI - The Complete Reference: Volume 1, The MPI Core (2nd ed) (The MIT Press) By Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra. ISBN 0-262-69215-5
MPI - The Complete Reference: Volume 2, The MPI Extensions (The MIT Press) By William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir, and Marc Snir. ISBN 0-262-57123-4.
NCSA MPI tutorial http://webct.ncsa.uiuc.edu:8900/public/MPI/

This article was originally published in ClusterWorld Magazine. It has been updated and formated for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Jeff Squyres is the Assistant Director for High Performance Comptuing for the Open Systems Laboratory at Indiana University and is the one of the lead technical architects of the Open MPI project.

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.