MPI Example
What do you do when you strace an MPI code? Ideally you want a file for each MPI process. How do you use strace? in conjunction with mpirun or mpiexec to produce one file per MPI process? One technique I use is to write a main script for the mpirun or mpiexec command. This main script then calls a second script that actually runs the code. In this second script I'll actually put the strace command and the binary. There are a few gotchas that I'll point out in the script. Readers may also be interested two debugging MPI articles by fellow Cluster Monkey Jeff Squyres: MPI: Debugging -- Can You Hear Me Now? and MPI: Debugging in Parallel (in Parallel).
Let's start with a simple example from the
MPI-2
book by Bill Gropp, et. at. In Chapter 2 the authors present a simple
example of an MPI code where each process of N processes writes data
to an individual file (this is usually referred to as N-N IO).
I modified the code to write more data than originally presented.
/* example of parallel Unix write into separate files */
#include "mpi.h"
#include
Being the versatile cluster geek that I am, I re-wrote the code in
Fortran for us older folks.
PROGRAM SEQIO
IMPLICIT NONE
INCLUDE 'mpif.h'
INTEGER :: I
INTEGER :: MYRANK, NUMPROCS, IERROR
INTEGER :: BUFSIZE
REAL :: BUF(100000)
INTEGER :: ISTATUS(MPI_STATUS_SIZE)
CHARACTER*10 FILENAME
CHARACTER*1 RANK1
CHARACTER*2 RANK2
! -------------------------------------------------------
BUFSIZE = 100000
IERROR = 0
CALL MPI_INIT(IERROR)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, MYRANK, IERROR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, NUMPROCS, IERROR)
!
DO I=1,BUFSIZE
BUF(I)= 2.5 * BUFSIZE + I
ENDDO
!
IF (MYRANK < 9 ) THEN
WRITE(6,11) MYRANK
WRITE(RANK1,10) MYRANK
FILENAME = "testfile"//RANK1
10 FORMAT(I1)
11 FORMAT("MYRANK:",I1)
ELSEIF ((MYRANK >= 9).AND.(MYRANK < 99)) THEN
WRITE(6,11) MYRANK
WRITE(RANK2,20) MYRANK
FILENAME = "testfile"//RANK2
20 FORMAT(I2)
ENDIF
!
OPEN(7,FILE=FILENAME,FORM="UNFORMATTED")
WRITE(7) BUF(1:BUFSIZE)
CLOSE(7)
!
CALL MPI_FINALIZE(IERROR)
END PROGRAM SEQIO
Let's spend a little bit of time writing the scripts we need to run the
code and get the strace output. Don't worry if you don't know bash
scripting. I'm not an expert by any stretch, and I usually have to ask
friends for help. But the scripts are simple, and I will
show you the couple of bits of specialized knowledge you need.
I start with a main script that I usually call main.sh that contains all of the setup for the code as well as the command to run the MPI code. For this example, I used MPICH2 for the MPI layer and I used g95 for the Fortran90 compiler and gcc for the C compiler. I won't cover all of the details of how to use MPICH2 since the website covers everything much better than I could. Below is the main script I use.
#!/bin/bash mpiexec -machinefile ./MACHINEFILE -np 4 /home/laytonj/TESTING/code1.shThe first line just says to use bash shell to run the script. The last line is the command to run the MPI code. In this case, it's mpiexec. Notice that what mpiexec actually runs is another script code1.sh.
Before I talk about the script code1.sh, I want to mention that it's fairly easy to adapt main.sh to a job scheduler such as SGE, Torque, PBS-Pro, or LSF. I don't have the space to talk about how to write job scripts for these schedulers, but it's fairly straight forward and there is documentation on the web. If you get stuck, you can always ask on the Beowulf mailing list.
Let's take a look at the meat of the scripts, the script code1.sh.
#!/bin/bash /usr/bin/strace -tt -o /tmp/strace.out.$$ /home/laytonj/TESTING/code1 $@Similar to the main script, this script starts by using the bash shell. The third line is the meat of the script. The first part of the line,
/usr/bin/strace -tt -o /tmp/strace.out.$$starts the code in the same way that we started the serial example, by using the command strace. As before I used the options -tt to get microsecond timing, and -o to point to an output file. Here's where we need to think about how to write the script so that each MPI processes writes to a separate output file.
This is the first bit of bash knowledge that we'll use in our scripts. In the script I have specified the strace output file as,
/tmp/strace.out.$$So the output files will be located in the /tmp directory on each node used in the run. To keep the files separate, I have added $$ to the end of the file name. In bash scripts, this is a special variable that contains the PID (Process ID) of the script. This is just a number that is specific to each MPI process. So now we have separate file names for each MPI process.
The last bit of bash knowledge we need is how to pass command line arguments to our code (if we need them). In this case, we use another predefined bash variable, $@. This allows you to use all of the arguments that were passed to the code1.shscript (arg1, arg2, ...) as arguments to the code itself. To better see how this works, let's look at a simple example to make sure you know how to do pass command line arguments to the code in code1.sh.
There is an IO benchmark called IOR that has a number of arguments you can pass to the code that describe the details of how to run the benchmark. Here's an example,
IOR -r -w -a MPIIO -b 25m -N 4 -s 25 -t 10m -v -oDon't won't worry about what all of the options mean, but let me point out a couple because they can be important for a job scheduler script. The option -N 4 tells the code to use 4 MPI processes. You can change the value of 4 to correspond to what the scheduler defines. Now how do we pass these arguments to the script that actually runs the code?
Sticking with the IOR example the main.sh script looks like,
#!/bin/bash mpiexec -machinefile ./MACHINEFILE -np 4 /home/laytonj/TESTING/code1.sh \ -r -w -a MPIIO -b 25m -N 4 -s 25 -t 10m -v -oNotice how I've taken the command line arguments and put them in the main.sh script. With the $@ bash predefined variable in the code script, the options are passed to the code. The code script doesn't change at all (except for the name of the binary).
#!/bin/bash /usr/bin/strace -tt -o /tmp/strace.out.$$ /home/laytonj/TESTING/IOR $@The only thing that changed was the name of the binary from code1 to IOR. So if you want to change the arguments to a code you have to modify the main script. If your code doesn't have any command line arguments, I would recommend just leaving $@ in the code for future reference.
If you look a little closer at the examples scripts for running IOR, it is interesting to note the differences compared to running it without strace. Normally, we would have a single script to run IOR where the command consists of three parts. In order they are,
- mpirun command (command to start MPI code)
- binary of code to run
- arguments to code
When the job is finished you have to go to each node used in the run, and copy the files from /tmp back to whatever file system is more permanent than /tmp. You could write all of the strace output files to a central file system, but you run the risk that you could get two PIDs that are the same. The chances of this are fairly small, but I don't like to take that chance.
Now that we know how to run our MPI jobs using strace, let's look through a simple example. I'm running the code that I presented earlier. I'm going to run with 4 MPI processes for this article. After I run the code I get four strace.out files.
strace.out.3821 strace.out.3822 strace.out.3823 strace.out.3824The PIDs are numbered sequentially because I ran all 4 MPI processes on the same machine. Let's look at one of the strace output files.
If you look at the strace file, you will notice that it is much longer
than for the serial case we ran. The reason is that now we're running
an MPI code so much of the extra function calls are due to MPI doing
it's thing in the background (i.e. behind our code). The first
strace output file is listed in Sidebar One at the end of this
article. I've extracted a few of the important lines from the output
and put them below.
15:12:54.920557 access("testfile1", F_OK) = -1 ENOENT (No such file or directory)
15:12:54.920631 access(".", R_OK) = 0
15:12:54.920687 access(".", W_OK) = 0
15:12:54.920748 stat64("testfile1", 0xbfa56800) = -1 ENOENT (No such file or directory)
15:12:54.920816 open("testfile1", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 7
...
15:12:54.943471 write(7, "\200\32\6\0@$tH\200$tH\300$tH\0%tH@%tH\200%tH\300%tH"..., 400008) = 400008
15:12:54.945790 ftruncate64(7, 400008) = 0
15:12:54.945888 _llseek(7, 0, [400008], SEEK_END) = 0
15:12:54.945954 ftruncate64(7, 400008) = 0
15:12:54.946010 _llseek(7, 0, [400008], SEEK_END) = 0
If you compare these lines to the ones in the serial code, you can see that they are very similar. Despite having more "junk" in the output, let's look at the IO performance.
The write function call writes the same amount of data, 400,008 bytes. The amount of time to write the data is,
54.945790 - 54.943471 = 0.002319 seconds (2319 micro-seconds).So the IO rate of the write function is,
400,008 bytes / 0.002319 secs. = 1.7249x10^8 bytes/secondThis works out to be 172.49 MB/s. A bit faster than the serial code, but again, I think there are some caching affects.
I won't examine the other 3 strace.out.* files since it's fairly straight forward to compute the write performance for each of them. But we're only compute the IO performance for a single write call. Imagine if you have a number of write and read calls in a single code. Then you have to perform the computations for a number of write and read calls.