6: Blaming MPI for Programmer Errors
A natural tendency when an application breaks is to blame the MPI implementation, particularly when your application "works" with one MPI implementation and (for example) seg faults in another. While no MPI implementation is perfect, they do typically go through heavy testing before release. It is quite possible (and likely) that your application actually has a latent bug that is simply not tripped on some architectures / MPI implementations.
This sounds arrogant (especially coming from an MPI implementer), but the vast majority of "bug reports" that we receive are actually due to errors in the user's application (and sometimes they are very subtle errors). For example, some compilers initialize variables to default values (such as zero). Others do not. If your code accidentally depends on a variable having a default value, it may work fine under some platforms / compilers, yet cause errors on others.
Before submitting a bug report to the maintainers, double and triple check your application. Use a memory-checking debugger, such as the Linux Valgrind package, the Solaris bcheck command-line checker, or the Purify system. All of these debuggers will report on the memory usage in your application, including buffer overflows, reading from uninitialized memory, and so on. You'd be surprised what will turn up in your application.
Where to Go From Here?
So what did we learn here?
- Ensure your environment is setup correctly. You only need to do this once.
- Always check non-blocking communication for completion. Don't leak resources.
- Avoid MPI_PROBE and MPI_IPROBE; they're evil.
- Ensure that you are using the Right compilers.
- Don't blame MPI for your errors. Use memory-checking debuggers.
If anything, realize that you are not alone if you run into MPI problems. The problems discussed in this column are all relatively easy to fix. So even if you can't get your MPI application to run - don't despair. The solution is probably just a few Google searches or a system administrator away.
Stay tuned - next column, we'll continue the list with my Top 5, All Time Favorite Evils to Avoid in Parallel.
|MPI Forum (MPI-1 and MPI-2 specifications documents)||http://www.mpi-forum.org/|
|MPI - The Complete Reference: Volume 1, The MPI Core (2nd ed) (The MIT Press)||By Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra. ISBN 0-262-69215-5|
|MPI - The Complete Reference: Volume 2, The MPI Extensions (The MIT Press)||By William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir, and Marc Snir. ISBN 0-262-57123-4.|
|NCSA MPI tutorial||http://webct.ncsa.uiuc.edu:8900/public/MPI/|
|The Tao of Programming||By Geoffrey James. ISBN 0931137071|
This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.
Jeff Squyres is the Assistant Director for High Performance Computing for the Open Systems Laboratory at Indiana University and is the one of the lead technical architects of the Open MPI project.
- << Prev