When the grass grows in parallel, do they know about each other? Do they coordinate? Perhaps they have some kind of collective intelligence? Do they use MPI?
In previous editions of this column, we've talked about the 6 basic functions of MPI, how MPI_INIT and MPI_FINALIZE actually work, and discussed in agonizing detail the differences between MPI ranks, MPI processes, and CPU processors. Armed with this knowledge, you can write large, sophisticated parallel programs. So what's next?
Collective communication is a next logical step - MPI's native ability to involve a group of MPI processes together in a single communication, possibly involving some intermediate computation.
MPI Collective Basic Concepts
Many parallel algorithms include the concept of a collective operation - an operation in which multiple processes participate in order to compute a result. A global sum is an easy example to discuss - each process contributes an integer that is summed in an atomic fashion and the final result is made available (perhaps just to a single "root" process, or perhaps made available to all participating processes).
A brief recap: MPI defines all point-to-point communications in terms of "communicators." Communicators are fixed sets of ordered processes with a unique context. Communication that occurs on a communicator is guaranteed to not collide with communications occurring on other communicators.
MPI also defines collective communication in terms of communicators. All collective operations explicitly involve every process in a communicator. Specifically: a collective will not be complete until all processes in the communicator have participated. Due to the nature of some of MPI's pre-defined collective operations (see the sidebar "Will That Collective Block?"), this may or may not imply blocking behavior. There is one exception to this rule: MPI_BARRIER is guaranteed not to return until all processes in the communicator have entered the barrier.
There are two main kinds of collectives defined in MPI: rooted and non-rooted. "Rooted" operations have a single process acting as the explicit originator or receiver of data. For example, MPI_BCAST broadcasts a buffer from a root process to all other processes in the communicator; MPI_GATHER gathers buffers from each process in the communicator to a single, combined buffer in the root process. "Non-rooted" operations are those where there is either no explicit originator/receiver or all processes are sending/receiving data. MPI_BARRIER, for example, has no explicit senders/receivers, but MPI_ALLGATHER both performs a gather operation from all processes in the processor and makes the result available to all processes.
Barrier Synchronization
Sidebar: Why Not Use MPI_SEND and MPI_RECV? |
An obvious question that arises is: why bother? Why not simple use a linear loop over MPI_SEND and MPI_RECV to effect the same kind of operations? In short: it's all about optimization. The MPI built-in collectives usually offer the following advantages:
The moral of the story: it is generally safer to trust your MPI implementation's collective algorithms than to implement your own. While no MPI implementation is perfect, most modern versions do a reasonable job of optimizing collective operations. |
One of the simplest collective operations to describe is the barrier synchronization. MPI's function for this is MPI_BARRIER. It takes a single significant argument: a communicator.
One should note that, while the argument lists of the MPI C, C++, and Fortran bindings for a given function are typically similar in terms of "significant" arguments, there are some minor differences. One notable difference is that all MPI Fortran calls take a final "ierr" argument that the C and C++ bindings do not. The "ierr" argument is used for passing errors back to the caller (errors are handled differently in the C and C++ bindings).
As described above, MPI_BARRIER does not return until all processes in the communicator have entered the barrier. The seemingly-simple barrier operation is a good example illustrating that a variety of different algorithms that can be used:
- linear: the root receives from all processes followed by the root sending to all processes
- logrithmic: a binomial tree gather to the root followed by a binomial tree scatter from the root
- 2-level latency split algorithm: a local gather, global gather, global scatter, and finally a local scatter
- N-level latency split algorithm: similar to the above, but for N levels, not 2
- shared memory: each process increments a shared counter; when the counter equals the number of processes, exit the barrier
The barrier operation has been researched for years (particularly in the area of shared memory algorithms; the shared memory algorithm listed above will typically provide dismal performance); many other algorithms are possible; the above list just a few possibilities.
There's no good reason for a user application to include implementations for all of these algorithms; the MPI implementation should provide some form of an optimized barrier (which may be one or more of the above algorithms) so that the user application does not have to worry about such issues.
Be wary of over using MPI_BARRIER. It is frequently tempting to insert barriers for ease of control and simplicity of code. However, barriers are usually unnecessary when writing MPI programs - MPI's tag/communicator matching rules for point-to-point communicator and "fence" operation for one-sided operations (to be described in a later column) typically obviate the need for barriers. Indeed, a barrier that is executed in every iteration of a repetitive code can introduce artificial performance limitations.