|
Page 2 of 2
Memory-Checking Debuggers
Memory-checking debuggers are the greatest thing since sliced
bread. Once you start using memory-checking debuggers, you'll wonder
how you programmed without them. In addition to identifying all the
"normal" causes of cashing that regular debuggers provide
(segmentation faults, bus errors, etc.), memory-checking debuggers
look for erroneous patterns such as accessing memory outside of an
array or the local stack, using heap memory that was already freed,
freeing memory that was already freed, using uninitialized variables,
and so on. Best of all, they will tell you these things by file and
line number in your source code.
Popular memory-checking debuggers include Valgrind (Linux), bcheck
(Solaris, part of the Forte compiler suite), Rational Purify (a
commercial product available for several operating systems), and
various forms of "malloc debug" (e.g., OS X has native
support). Others are also available.
Most memory-checking debuggers are typically intended to be used
non-interactively and cannot be attached to already-running
processes. As such, depending on your MPI implementation, they can
only be launched via mpirun. For example (Running Valgrind in parallel)
$ mpirun -np 2 valgrind -num-callers=100 \
-tool-memcheck -leak-check=yes \
-show-reachable=yes -log-file=output my_mpi_app
This command will run two copies of Valgrind, which will, in turn,
each launch a copy of my_mpi_app. Each of the Valgrind
instances will monitor their child process and send their output to a
file named foo.pid[PID]. After the application completes,
the foo files can be examined to see the errors that Valgrind
detected.
Sidebar:
Should I Compile My MPI with -g?
|
Many users ask me if they need to compile their MPI implementation
with "-g" to enable them to debug their MPI application.
The answer is no. The application being debugged must be compiled with
"-g", however. The MPI implementation itself should not be compiled
with "-g"; this will disallow the debugger from stepping into MPI
functions. Hence, if you try to "step" into MPI_SEND, the
debugger won't let you and will likely execute the
entire MPI_SEND function call. This result is what most users
want anyway - you're attempting to debug your application, not the MPI
implementation.
If you compile your MPI implementation with "-g", you'll be able to
step into functions such as MPI_SEND, but this may not
provide as much useful information as you would think - the internals
of an MPI implementation are quite complex.
Note that this sidebar really only applies to MPI implementations that
provide their source code; binary-only MPI implementation
distributions are most likely compiled without "-g" (and, by
definition, the debugger will not be able to find the source code to
display).
|
Note that LAM/MPI should be configured with the "-with-purify" switch
to be used with memory-checking debuggers. This switch eliminates many
false positives at the expense of a slight performance loss (i.e., LAM
uses some optimizations that are known to be safe, but tools such as
Valgrind will interpret them as reading from uninitialized memory).
Although memory-checking debuggers cannot catch all errors,
they can help find a lot of errors even before you know that
they exist (even for serial applications). My own personal experience
has shown that it can extremely helpful to use memory-checking
debuggers frequently during an application's development - even when
you are not aware of any current bugs.
Parallel Debuggers
Finally, there are debuggers specifically created to operate on
parallel MPI applications. Three commercial suites are the Distributed
Debugging Tool (DDT) from Allinea, Fx2 from Absoft, and Totalview from
Etnus. These packages have the significant advantage over the prior
approaches in that they can natively understand an entire parallel
process. Specifically, in addition to all the normal functionality of
a debugger (setting breakpoints, examining variables, stepping through
code, etc.), you can individually monitor and control all processes in
a running MPI job.
For example, you can step through the code in process A (while
blocking all other processes) and watch a message being sent. Then you
can step through process B and watch the message being received. In
this manner, you have complete control over the entire parallel job.
This kind of tool is invaluable for serious parallel application
development, but tend to be somewhat expensive. If you can afford
them, parallel debuggers are extremely helpful tools.
Where to Go From Here?
The moral of this column is: fear not. There really is more to
parallel debugging than printf, Virginia. Debugging is a
tricky task, but using the proper tools can greatly reduce
the task to something that is manageable.
Next column, we'll continue the debugging discussion and describe some
common MPI programming errors and how you can use the techniques
described here to find them.
Got any MPI questions you want answered? Wondering why one MPI
does this and another does that? Send
them to the MPI Monkey.
Resources
| Allinea DDT |
| Absoft Fx2 |
| LAM/MPI FAQ (more information on debugging in parallel) |
| MPI Forum (MPI-1 and MPI-2 specifications documents) |
| MPI - The Complete Reference: Volume 1, The MPI Core (2nd ed) (The
MIT Press) By Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and
Jack Dongarra. ISBN 0-262-69215-5 |
| MPI - The Complete Reference: Volume 2, The MPI Extensions (The
MIT Press) By William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing
Lusk, Bill Nitzberg, William Saphir, and Marc Snir. ISBN
0-262-57123-4. |
| NCSA MPI tutorial |
| IBM Purify |
| Etnus Totalview |
| Valgrind Project |
This article was originally published in ClusterWorld Magazine. It
has been updated and formatted for the web. If you want to read more
about HPC clusters and Linux, you may wish to visit
Linux Magazine.
Jeff Squyres is leading up Cisco's Open MPI efforts as part of
the Server Virtualization Business Unit.
Comment on this article
You must login to leave comments...
Other Visitors Comments
There are no comments currently....
|