Using Strace to Examine Your IO

A special article for IO Knuckleheads

In my experience there are quite a few people who don't really know the IO performance of their application(s) nor the IO pattern. So rather than complain about this situation, as I'm apt to do, I decided to write a column that talks about how to capture the IO behavior of your code(s). These steps can be done for codes where you only have the binary, not necessarily the source. It's fairly easy to capture the information. Processing it is another story, however.

This article will give you an introduction in how to use strace, a commonly used debugging tool for system administrators, as a tool to gather information on the IO pattern of your code(s). I'll show you how to use and gather the information, including how you use it for MPI codes. Read on if you want to learn more about how to use this simple tool to make yourself a better cluster, well..., monkey.

`strace` - The Friend You Never Knew You Had

strace is one of those tools that system administrators use to track down problems. Here's an article that is an example of how people use strace to debug or troubleshoot problems in *nix operating systems. But as we will see, strace is one of those tools that is multi-purpose.

Before we go too much further, let's look at the man pages for strace. I'm looking at the man pages on my CentOS system but I won't be reproducing the entire man page here. I'll leave that up to you. But I will extract certain sections that I think are relevant. Normally I hate reading man pages, but in this case, I think it's important to get an idea of what strace can do.

NAME
   strace - trace system calls and signals

SYNOPSIS
   strace [ -dffhiqrtttTvxx ] [ -acolumn ] [ -eexpr ] ...  [ -ofile ] [ -ppid ] 
          ...  [ -sstrsize ] [ -uusername ] [ -Evar=val ] ...  [ -Evar ] ...  
          [ command [ arg ...  ] ]

   strace -c [ -eexpr ] ...  [ -Ooverhead ] [ -Ssortby ] [ command [ arg ...  ] ]

DESCRIPTION
   In the simplest case strace runs the specified command until it  exits.   It
   intercepts  and  records  the system calls which are called by a process and
   the signals which are received by  a process.  The name of each system call,   
   its arguments and its return value are  printed  on standard error or to the  
   file specified with the -o option.

   Each line in the trace contains the system call name, followed by its argum-
   ents in  parentheses and its return value.  

OPTIONS

   -t          Prefix each line of the trace with the time of day.
   -tt         If given twice, the time printed will include the microseconds.

   -o filename Write the trace output to the file filename rather than  to
               stderr.   Use filename.pid  if -ff is used.  If the argument
               begins with ?|? or with ?!? then the rest of the argument is 
               treated as a command and  all  output  is piped to it.  This 
               is convenient for piping the debugging output to a program 
               without affecting the redirections of executed programs.

                                    2003-01-21                      STRACE(1)

OK, this is a lot to swallow, but I've listed the options I think are worthwhile to achieve our goals. If you look at the above synopsis and compare it to the entire man page, you'll see there are lots of other options that I haven't included. If you are interested in them, please look at the man pages.

The first option of interest,-tt, is the timing of the system call in microseconds. The reason that I want to use this option is that it will tell us at what time each system call happens, in particular IO functions such as read, write, lseek, etc. We can use this information to compute how much time each function call takes which we can turn into IO rates (MB/s). The second option is the name of the output file. This option, -o allows you to dump all of the strace output to a file. (A screen dump can get rather long!)

When you run a code the first thing you type is the name of the binary itself. If you want to get a directory listing you can type something like, ls -s. When you use strace you run the strace command first, followed but the binary you want to examine. The binary is actually started by strace. So the simple ls -s example looks like, strace ls -s.

I've made a couple of assumptions in only using these two options. The first one is that the application does not spawn any tasks that do IO. Also, there are no threads beside thread 0 that perform any IO. I hope you can live within these assumptions. If you can't, you can still get the strace output, but it's a bit harder to process.

A Simple Serial Example

I'm sure there are a number of people who already know how to use strace to examine the IO of a code, but for those who don't know how, let's walk through a simple example. I'm going to use Fortran 90 to write a simple example. My apologies to the other programmers, but it's the language I've chosen (I'll give a C example later). Who knows, maybe we can turn you into a Fortran coder yet.

The code below is a simple Fortran90 code that just creates an array of real numbers and writes it to a unformatted (binary) file. Pretty simple so far - huh?

PROGRAM SERIAL
   IMPLICIT NONE

   INTEGER :: I
   INTEGER :: BUFSIZE
   REAL :: BUF(100000)
! ----------------------------------------------------------------------
   BUFSIZE = 100000
!
   DO I=1,BUFSIZE
      BUF(I)= 2.5 * BUFSIZE + I
   ENDDO
!
   OPEN(7,FILE="testfile",FORM="UNFORMATTED")
   WRITE(7) BUF(1:BUFSIZE)
   CLOSE(7)
!
END PROGRAM SERIAL

I compiled the program with g95.

g95 -O2 serial.f90 -o serial

I then ran the binary, serial on my system but I used strace to capture the function calls from the binary to a file called serial.strace.

strace -tt -o serial.strace ./serial

Now, let's look at the output (warning - it's quite long).

19:54:31.807138 execve("./serial", ["./serial"], [/* 39 vars */]) = 0
19:54:31.807367 uname({sys="Linux", node="home8", ...}) = 0
19:54:31.807567 brk(0)                  = 0x80a0000
19:54:31.807653 access("/etc/ld.so.preload", R_OK) = -1 ENOENT 
(No such file or directory)
19:54:31.807752 open("tls/i686/libm.so.6", O_RDONLY) = -1 ENOENT 
(No such file or directory)
19:54:31.807814 open("tls/libm.so.6", O_RDONLY) = -1 ENOENT 
(No such file or directory)
19:54:31.807873 open("i686/libm.so.6", O_RDONLY) = -1 ENOENT 
(No such file or directory)
19:54:31.807931 open("libm.so.6", O_RDONLY) = -1 ENOENT 
(No such file or directory)
19:54:31.807988 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/tls/
i686/libm.so.6", O_RDONLY) = -1 ENOENT (
No such file or directory)
19:54:31.808066 stat64("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/
tls/i686", 0xbfa589bc) = -1 ENOENT (No such file or directory)
19:54:31.808136 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/tls/libm.so.6", 
O_RDONLY) = -1 ENOENT (No su
ch file or directory)
19:54:31.808207 stat64("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/tls", 0xbfa589bc) = -1 ENOENT (No such fil
e or directory)
19:54:31.808276 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/i686/libm.so.6", O_RDONLY) = -1 ENOENT (No s
uch file or directory)
19:54:31.808347 stat64("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/i686", 0xbfa589bc) = -1 ENOENT (No such fi
le or directory)
19:54:31.808415 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/libm.so.6", O_RDONLY) = -1 ENOENT (No such f
ile or directory)
19:54:31.808485 stat64("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib", {st_mode=S_IFDIR|0775, st_size=4096, ...
}) = 0
19:54:31.808593 open("/etc/ld.so.cache", O_RDONLY) = 3
19:54:31.808655 fstat64(3, {st_mode=S_IFREG|0644, st_size=87369, ...}) = 0
19:54:31.808734 old_mmap(NULL, 87369, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7f44000
19:54:31.808808 close(3)                = 0
19:54:31.808874 open("/lib/tls/libm.so.6", O_RDONLY) = 3
19:54:31.808945 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\0ce\000"..., 512) = 512
19:54:31.809029 fstat64(3, {st_mode=S_IFREG|0755, st_size=138684, ...}) = 0
19:54:31.809110 old_mmap(0x653000, 139424, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x653000
19:54:31.809180 old_mmap(0x674000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x20000)
 = 0x674000
19:54:31.809270 close(3)                = 0
19:54:31.809323 open("tls/i686/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
19:54:31.809382 open("tls/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
19:54:31.809440 open("i686/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
19:54:31.809497 open("libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
19:54:31.809553 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such f
ile or directory)
19:54:31.809631 open("/lib/tls/libc.so.6", O_RDONLY) = 3
19:54:31.809696 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\20_\27"..., 512) = 512
19:54:31.809768 fstat64(3, {st_mode=S_IFREG|0755, st_size=1212936, ...}) = 0
19:54:31.809846 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f43000
19:54:31.809922 old_mmap(0x161000, 1215644, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x161000
19:54:31.810016 old_mmap(0x284000, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x12300
0) = 0x284000
19:54:31.810107 old_mmap(0x288000, 7324, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x
288000
19:54:31.810181 close(3)                = 0
19:54:31.810254 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f42000
19:54:31.810601 mprotect(0x284000, 4096, PROT_READ) = 0
19:54:31.810693 mprotect(0x15d000, 4096, PROT_READ) = 0
19:54:31.810748 set_thread_area({entry_number:-1 -> 6, base_addr:0xb7f426c0, limit:1048575, seg_32bit:1, conten
ts:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0
19:54:31.810811 munmap(0xb7f44000, 87369) = 0
19:54:31.810997 brk(0)                  = 0x80a0000
19:54:31.811077 brk(0x80a1000)          = 0x80a1000
19:54:31.811137 brk(0x80a5000)          = 0x80a5000
19:54:31.811205 ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
19:54:31.811302 brk(0x80a9000)          = 0x80a9000
19:54:31.811366 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
19:54:31.811432 brk(0x80ad000)          = 0x80ad000
19:54:31.811494 ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
19:54:31.811573 rt_sigaction(SIGFPE, {0x8053654, [FPE], SA_SIGINFO}, NULL, 8) = 0
19:54:31.811645 rt_sigaction(SIGHUP, {SIG_DFL}, NULL, 8) = 0
19:54:31.811700 rt_sigaction(SIGINT, {SIG_DFL}, NULL, 8) = 0
19:54:31.811755 rt_sigaction(SIGQUIT, {0x806641c, [], 0}, NULL, 8) = 0
19:54:31.811813 rt_sigaction(SIGALRM, {0x806641c, [], 0}, NULL, 8) = 0
19:54:31.811883 mmap2(NULL, 401408, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ee0000
19:54:31.813013 rt_sigprocmask(SIG_BLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.813178 access("testfile", F_OK) = -1 ENOENT (No such file or directory)
19:54:31.813251 access(".", R_OK)       = 0
19:54:31.813305 access(".", W_OK)       = 0
19:54:31.813374 stat64("testfile", 0xbfa57190) = -1 ENOENT (No such file or directory)
19:54:31.813439 open("testfile", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 3
19:54:31.813541 brk(0x80b1000)          = 0x80b1000
19:54:31.813613 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfa58118) = -1 ENOTTY (Inappropriate ioctl for device
)
19:54:31.813688 rt_sigprocmask(SIG_UNBLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.813747 rt_sigprocmask(SIG_BLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.813900 brk(0x80b2000)          = 0x80b2000
19:54:31.814157 brk(0x80b4000)          = 0x80b4000
19:54:31.814392 brk(0x80b8000)          = 0x80b8000
19:54:31.814782 brk(0x80c0000)          = 0x80c0000
19:54:31.815543 brk(0x80cf000)          = 0x80cf000
19:54:31.816942 brk(0x80ef000)          = 0x80ef000
19:54:31.825402 mmap2(NULL, 258048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ea1000
19:54:31.831774 mmap2(NULL, 516096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7e23000
19:54:31.832821 munmap(0xb7ea1000, 258048) = 0
19:54:31.840115 mmap2(NULL, 401408, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7dc1000
19:54:31.841593 write(3, "\200\32\6\0@$tH\200$tH\300$tH\0%tH@%tH\200%tH\300%tH"..., 400008) = 400008
19:54:31.843949 ftruncate64(3, 400008)  = 0
19:54:31.844049 _llseek(3, 0, [400008], SEEK_END) = 0
19:54:31.844112 ftruncate64(3, 400008)  = 0
19:54:31.844165 _llseek(3, 0, [400008], SEEK_END) = 0
19:54:31.844220 rt_sigprocmask(SIG_UNBLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.844285 rt_sigprocmask(SIG_BLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.844344 munmap(0xb7dc1000, 401408) = 0
19:54:31.844444 close(3)                = 0
19:54:31.844499 munmap(0xb7e23000, 516096) = 0
19:54:31.844593 brk(0x80ad000)          = 0x80ad000
19:54:31.844670 rt_sigprocmask(SIG_UNBLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.844727 munmap(0xb7ee0000, 401408) = 0
19:54:31.844846 exit_group(0)           = ?

Now we're getting to the good stuff! We can ignore most of the stuff in the output since we're really interested in the IO part of it. So I've extracted the relevant IO portion(s) of the output below:

19:54:31.813178 access("testfile", F_OK) = -1 ENOENT (No such file or directory)
19:54:31.813251 access(".", R_OK)       = 0
19:54:31.813305 access(".", W_OK)       = 0
19:54:31.813374 stat64("testfile", 0xbfa57190) = -1 ENOENT (No such file or directory)
19:54:31.813439 open("testfile", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 3
19:54:31.813541 brk(0x80b1000)          = 0x80b1000
19:54:31.813613 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfa58118) = -1 ENOTTY (Inappropriate ioctl for device
)
19:54:31.813688 rt_sigprocmask(SIG_UNBLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.813747 rt_sigprocmask(SIG_BLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.813900 brk(0x80b2000)          = 0x80b2000
19:54:31.814157 brk(0x80b4000)          = 0x80b4000
19:54:31.814392 brk(0x80b8000)          = 0x80b8000
19:54:31.814782 brk(0x80c0000)          = 0x80c0000
19:54:31.815543 brk(0x80cf000)          = 0x80cf000
19:54:31.816942 brk(0x80ef000)          = 0x80ef000
19:54:31.825402 mmap2(NULL, 258048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ea1000
19:54:31.831774 mmap2(NULL, 516096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7e23000
19:54:31.832821 munmap(0xb7ea1000, 258048) = 0
19:54:31.840115 mmap2(NULL, 401408, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7dc1000
19:54:31.841593 write(3, "\200\32\6\0@$tH\200$tH\300$tH\0%tH@%tH\200%tH\300%tH"..., 400008) = 400008
19:54:31.843949 ftruncate64(3, 400008)  = 0
19:54:31.844049 _llseek(3, 0, [400008], SEEK_END) = 0
19:54:31.844112 ftruncate64(3, 400008)  = 0
19:54:31.844165 _llseek(3, 0, [400008], SEEK_END) = 0
19:54:31.844220 rt_sigprocmask(SIG_UNBLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.844285 rt_sigprocmask(SIG_BLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.844344 munmap(0xb7dc1000, 401408) = 0
19:54:31.844444 close(3)                = 0

There is still some junk in here that we aren't concerned about. I've left them in the above output to help guide through the good bits which I have marked in bold. Let's walk through them a step at a time.

The first five bold lines,

19:54:31.813178 access("testfile", F_OK) = -1 ENOENT (No such file or directory)
19:54:31.813251 access(".", R_OK)       = 0
19:54:31.813305 access(".", W_OK)       = 0
19:54:31.813374 stat64("testfile", 0xbfa57190) = -1 ENOENT (No such file or directory)
19:54:31.813439 open("testfile", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 3

simply opens the output file, testfile. The first line is checking if the file exists (it doesn't - hence the message "No such file or directory"). The second line checks the directory for read access (it can) and the third line checks for write access (it can). The fourth line returns a status of the file (primarily to see if it exists - and it doesn't). The file is finally opened on the fifth line,

open("testfile", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 3

Then the code does some other things, not pertaining to IO, until the line,

19:54:31.841593 write(3, "\200\32\6\0@$tH\200$tH\300$tH\0%tH@%tH\200%tH\300%tH"..., 400008) = 400008

You can find out what the function write does by looking in the man pages, but you may have to use the command man 2 write to get the correct man pages. The function write writes data to a file descriptor (basically a file). You can see the data that is being written inside the quotations. For this example, the data is binary so the data doesn't many any sense (unless you are real geek of course). But if the data is text or non-binary data, then you might be able to read the data. strace does not list all of the data being written to the file. The very last number, after the equal sign, is the number of bytes actually written to the file descriptor. In this case, it's 400,008 bytes.

The write function call also contains more useful information. The time that the function begins is printed before the function. I used the -tt option which gives the time including microseconds. If we look at the time for the next function (the next line in the output), we can take the difference between the times to get the elapsed time to perform the previous function. For this example, it is the amount of time it took to perform the write function. Usually it's good enough to look at the seconds and microseconds to determine the amount of time for the write or read function. In this case the elapsed time to perform the write function is,

31.843949 - 31.841593 = 0.0023560 seconds (2356.0 microseconds)

Now we know that the amount of data is 400,008 bytes and the time is 0.0023560 seconds. So the IO rate is,

400,008 bytes / 0.0023560 seconds = 1.697x10^8 bytes/sec

This is the same as 169.78 MB/s assuming that a MB is 1,000,000 bytes. Now you can go through and compute the IO rate for all read and write functions in your strace! (Doesn't this cry out for automation?).

As a side note, the IO rate is pretty high. I was writing to a simple software RAID-1 with two SATA drives and formatted with ext3. The high IO performance is caused by the caching of the file system and possibly even the drive.

MPI Example

What do you do when you strace an MPI code? Ideally you want a file for each MPI process. How do you use strace? in conjunction with mpirun or mpiexec to produce one file per MPI process? One technique I use is to write a main script for the mpirun or mpiexec command. This main script then calls a second script that actually runs the code. In this second script I'll actually put the strace command and the binary. There are a few gotchas that I'll point out in the script. Readers may also be interested two debugging MPI articles by fellow Cluster Monkey Jeff Squyres: MPI: Debugging -- Can You Hear Me Now? and MPI: Debugging in Parallel (in Parallel).

Let's start with a simple example from the MPI-2 book by Bill Gropp, et. at. In Chapter 2 the authors present a simple example of an MPI code where each process of N processes writes data to an individual file (this is usually referred to as N-N IO). I modified the code to write more data than originally presented.

/* example of parallel Unix write into separate files */ 
#include "mpi.h" 
#include  
#define BUFSIZE 100000 
 
int main(int argc, char *argv[]) 
{ 
    int i, myrank, buf[BUFSIZE]; 
    char filename[128]; 
    FILE *myfile; 
 
    MPI_Init(&argc, &argv); 
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank); 
    for (i=0; i < BUFSIZE; i++) 
	buf[i] = myrank * BUFSIZE + i; 
    sprintf(filename, "testfile.%d", myrank); 
    myfile = fopen(filename, "w"); 
    fwrite(buf, sizeof(int), BUFSIZE, myfile); 
    fclose(myfile); 
    MPI_Finalize(); 
    return 0; 
}

Being the versatile cluster geek that I am, I re-wrote the code in Fortran for us older folks.

PROGRAM SEQIO
   IMPLICIT NONE
   INCLUDE 'mpif.h'

   INTEGER :: I
   INTEGER :: MYRANK, NUMPROCS, IERROR
   INTEGER :: BUFSIZE
   REAL :: BUF(100000)
   INTEGER :: ISTATUS(MPI_STATUS_SIZE)
   CHARACTER*10 FILENAME
   CHARACTER*1 RANK1
   CHARACTER*2 RANK2
! -------------------------------------------------------
   BUFSIZE = 100000
   IERROR = 0
   CALL MPI_INIT(IERROR)
   CALL MPI_COMM_RANK(MPI_COMM_WORLD, MYRANK, IERROR)
   CALL MPI_COMM_SIZE(MPI_COMM_WORLD, NUMPROCS, IERROR)
!
   DO I=1,BUFSIZE
      BUF(I)= 2.5 * BUFSIZE + I
   ENDDO
!
   IF (MYRANK < 9 ) THEN
      WRITE(6,11) MYRANK
      WRITE(RANK1,10) MYRANK
      FILENAME = "testfile"//RANK1
10    FORMAT(I1)
11    FORMAT("MYRANK:",I1)
   ELSEIF ((MYRANK >= 9).AND.(MYRANK < 99)) THEN
      WRITE(6,11) MYRANK
      WRITE(RANK2,20) MYRANK
      FILENAME = "testfile"//RANK2
20    FORMAT(I2)
   ENDIF
!
   OPEN(7,FILE=FILENAME,FORM="UNFORMATTED")
   WRITE(7) BUF(1:BUFSIZE)
   CLOSE(7)
!
   CALL MPI_FINALIZE(IERROR)
   
END PROGRAM SEQIO

Let's spend a little bit of time writing the scripts we need to run the code and get the strace output. Don't worry if you don't know bash scripting. I'm not an expert by any stretch, and I usually have to ask friends for help. But the scripts are simple, and I will show you the couple of bits of specialized knowledge you need.

I start with a main script that I usually call main.sh that contains all of the setup for the code as well as the command to run the MPI code. For this example, I used MPICH2 for the MPI layer and I used g95 for the Fortran90 compiler and gcc for the C compiler. I won't cover all of the details of how to use MPICH2 since the website covers everything much better than I could. Below is the main script I use.

#!/bin/bash

mpiexec -machinefile ./MACHINEFILE -np 4 /home/laytonj/TESTING/code1.sh

The first line just says to use bash shell to run the script. The last line is the command to run the MPI code. In this case, it's mpiexec. Notice that what mpiexec actually runs is another script code1.sh.

Before I talk about the script code1.sh, I want to mention that it's fairly easy to adapt main.sh to a job scheduler such as SGE, Torque, PBS-Pro, or LSF. I don't have the space to talk about how to write job scripts for these schedulers, but it's fairly straight forward and there is documentation on the web. If you get stuck, you can always ask on the Beowulf mailing list.

Let's take a look at the meat of the scripts, the script code1.sh.

#!/bin/bash

/usr/bin/strace -tt -o /tmp/strace.out.$$ /home/laytonj/TESTING/code1 $@

Similar to the main script, this script starts by using the bash shell. The third line is the meat of the script. The first part of the line,

/usr/bin/strace -tt -o /tmp/strace.out.$$

starts the code in the same way that we started the serial example, by using the command strace. As before I used the options -tt to get microsecond timing, and -o to point to an output file. Here's where we need to think about how to write the script so that each MPI processes writes to a separate output file.

This is the first bit of bash knowledge that we'll use in our scripts. In the script I have specified the strace output file as,

/tmp/strace.out.$$

So the output files will be located in the /tmp directory on each node used in the run. To keep the files separate, I have added $$ to the end of the file name. In bash scripts, this is a special variable that contains the PID (Process ID) of the script. This is just a number that is specific to each MPI process. So now we have separate file names for each MPI process.

The last bit of bash knowledge we need is how to pass command line arguments to our code (if we need them). In this case, we use another predefined bash variable, $@. This allows you to use all of the arguments that were passed to the code1.shscript (arg1, arg2, ...) as arguments to the code itself. To better see how this works, let's look at a simple example to make sure you know how to do pass command line arguments to the code in code1.sh.

There is an IO benchmark called IOR that has a number of arguments you can pass to the code that describe the details of how to run the benchmark. Here's an example,

IOR -r -w -a MPIIO -b 25m -N 4 -s 25 -t 10m -v -o

Don't won't worry about what all of the options mean, but let me point out a couple because they can be important for a job scheduler script. The option -N 4 tells the code to use 4 MPI processes. You can change the value of 4 to correspond to what the scheduler defines. Now how do we pass these arguments to the script that actually runs the code?

Sticking with the IOR example the main.sh script looks like,

#!/bin/bash

mpiexec -machinefile ./MACHINEFILE -np 4 /home/laytonj/TESTING/code1.sh \
  -r -w -a MPIIO -b 25m -N 4 -s 25 -t 10m -v -o

Notice how I've taken the command line arguments and put them in the main.sh script. With the $@ bash predefined variable in the code script, the options are passed to the code. The code script doesn't change at all (except for the name of the binary).

#!/bin/bash

/usr/bin/strace -tt -o /tmp/strace.out.$$ /home/laytonj/TESTING/IOR $@

The only thing that changed was the name of the binary from code1 to IOR. So if you want to change the arguments to a code you have to modify the main script. If your code doesn't have any command line arguments, I would recommend just leaving $@ in the code for future reference.

If you look a little closer at the examples scripts for running IOR, it is interesting to note the differences compared to running it without strace. Normally, we would have a single script to run IOR where the command consists of three parts. In order they are,

mpirun command (command to start MPI code)
binary of code to run
arguments to code

If we want to run strace against the binary, then we make a single change to the main script. We change the name of the binary to a script. Then in this new script we put the strace command and the name of the actual binary that is to be run. So it's fairly simple to take a current script or scheduler job script and modify it to allow for strace.

When the job is finished you have to go to each node used in the run, and copy the files from /tmp back to whatever file system is more permanent than /tmp. You could write all of the strace output files to a central file system, but you run the risk that you could get two PIDs that are the same. The chances of this are fairly small, but I don't like to take that chance.

Now that we know how to run our MPI jobs using strace, let's look through a simple example. I'm running the code that I presented earlier. I'm going to run with 4 MPI processes for this article. After I run the code I get four strace.out files.

strace.out.3821
strace.out.3822
strace.out.3823
strace.out.3824

The PIDs are numbered sequentially because I ran all 4 MPI processes on the same machine. Let's look at one of the strace output files.

If you look at the strace file, you will notice that it is much longer than for the serial case we ran. The reason is that now we're running an MPI code so much of the extra function calls are due to MPI doing it's thing in the background (i.e. behind our code). The first strace output file is listed in Sidebar One at the end of this article. I've extracted a few of the important lines from the output and put them below.

15:12:54.920557 access("testfile1", F_OK) = -1 ENOENT (No such file or directory)
15:12:54.920631 access(".", R_OK)       = 0
15:12:54.920687 access(".", W_OK)       = 0
15:12:54.920748 stat64("testfile1", 0xbfa56800) = -1 ENOENT (No such file or directory)
15:12:54.920816 open("testfile1", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 7
...
15:12:54.943471 write(7, "\200\32\6\0@$tH\200$tH\300$tH\0%tH@%tH\200%tH\300%tH"..., 400008) = 400008
15:12:54.945790 ftruncate64(7, 400008)  = 0
15:12:54.945888 _llseek(7, 0, [400008], SEEK_END) = 0
15:12:54.945954 ftruncate64(7, 400008)  = 0
15:12:54.946010 _llseek(7, 0, [400008], SEEK_END) = 0

If you compare these lines to the ones in the serial code, you can see that they are very similar. Despite having more "junk" in the output, let's look at the IO performance.

The write function call writes the same amount of data, 400,008 bytes. The amount of time to write the data is,

54.945790 - 54.943471 = 0.002319 seconds (2319 micro-seconds).

So the IO rate of the write function is,

400,008 bytes / 0.002319 secs. = 1.7249x10^8 bytes/second

This works out to be 172.49 MB/s. A bit faster than the serial code, but again, I think there are some caching affects.

I won't examine the other 3 strace.out.* files since it's fairly straight forward to compute the write performance for each of them. But we're only compute the IO performance for a single write call. Imagine if you have a number of write and read calls in a single code. Then you have to perform the computations for a number of write and read calls.

Final Comments

I hope this article has given you a bit of an introduction on how to use strace to examine the IO pattern of your code. Notice that I didn't have to modify the code or rebuild it with a new library. This means you run an strace on an existing binary. So it's possible to examine the output pattern of your own codes or ISV codes for which you don't have source code.

This article is only an introduction. The output from a real code is much more extensive and complicated. Plus, you have to examine the strace output carefully to understand the IO patterns.

For example, you need to examine the strace.out files for IO functions in addition to open(), write(), read(), and close(). You want to examine if the code writes a little bit of data (maybe a few kilobytes), then uses lseek to move to a different part of the file, writes some more data, and so on. This IO pattern can be very difficult for file systems to optimize because they don't know how far the code lseeks so they have to assume that the data needs to be flushed to disk. Consequently the file system is writing a bunch of small files which is difficult for any distributed or parallel file system to do well.

Of course, there are IO patterns that some file systems can recognize and adapt to. For example, the code starts at the end of a file, lseeks backward a few megabytes, reads a few kilobytes, lseeks backwards, reads a few kilobytes, and so on. Some distributed or parallel file systems can recognize this IO pattern and perform some optimizations, such as caching the next few megabytes, to improve performance.

The key message of this article is that you need to examine the IO performance of your codes and you can do it without too much difficulty using strace. You may want to examine the IO pattern for several reasons.

To learn if IO is a performance bottleneck
To understand your IO pattern
To more critically examine distributed or parallel file system offerings

I didn't start this article with a rant about how little people know about the IO pattern of the code. But it's true. I've seen people who have no idea how much improving the IO performance will improve the performance of their code. There are also people who claim they know the IO pattern of their code and swear that they have to have a massive high-speed IO system for vastly improved IO performance. They sometimes even quote you huge IO throughputs that they just have to have. But, in actuality they have no idea what improved IO performance will do for their code performance. On the other hand, there are people who have a true understanding of the IO pattern of their code. I like those people.

I hope this article starts your on your way to better understanding the IO pattern of your code so that you don't become one of the knuckleheads I just described.

Processing the strace output files is not easy. In fact, a Perl or Python script that can take the output and produce something that looks readable or even a histogram of the IO performance would be very useful. I've looked on Google and I haven't found any yet, but I think there are a couple out there somewhere. If you find them or if you write one yourself, please contact me through Cluster Monkey or post it to the beowulf mailing list to help everyone else.

Jeff Layton is having way too much fun writing this article, proving that it's hard to keep a good geek down. When he's not creating havoc in his household, he can be found hanging out at the Fry's coffee shop (never during working hours) and admiring the shiny new CPUs that come in, and cringing when someone buys Microsoft Vista. The fund I started to buy some storage for myself is still open and taking donations.

Sidebar One: Example strace from MPI code

15:12:54.806723 execve("/home/laytonj/TESTING/code1", ["/home/laytonj/TESTING/code1"], [/* 46 vars */]) = 0
15:12:54.835463 uname({sys="Linux", node="home8", ...}) = 0
15:12:54.835909 brk(0)                  = 0x8133000
15:12:54.836061 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
15:12:54.836275 open("tls/i686/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.836426 open("tls/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.836553 open("i686/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.836681 open("libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.836805 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/tls/i686/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.837000 stat64("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/tls/i686", 0xbfa5807c) = -1 ENOENT (No such file or directory)
15:12:54.837150 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/tls/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.837307 stat64("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/tls", 0xbfa5807c) = -1 ENOENT (No such file or directory)
15:12:54.837454 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/i686/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.837611 stat64("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/i686", 0xbfa5807c) = -1 ENOENT (No such file or directory)
15:12:54.837757 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.837912 stat64("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
15:12:54.838136 open("/etc/ld.so.cache", O_RDONLY) = 3
15:12:54.838267 fstat64(3, {st_mode=S_IFREG|0644, st_size=87369, ...}) = 0
15:12:54.838431 old_mmap(NULL, 87369, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7f44000
15:12:54.838592 close(3)                = 0
15:12:54.838713 open("/lib/tls/libpthread.so.0", O_RDONLY) = 3
15:12:54.838879 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\20xz\000"..., 512) = 512
15:12:54.839046 fstat64(3, {st_mode=S_IFREG|0755, st_size=61492, ...}) = 0
15:12:54.839213 old_mmap(0x7a3000, 70108, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7a3000
15:12:54.839365 old_mmap(0x7b1000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xd000) = 0x7b1000
15:12:54.839547 old_mmap(0x7b3000, 4572, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7b3000
15:12:54.839716 close(3)                = 0
15:12:54.839828 open("tls/i686/librt.so.1", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.839961 open("tls/librt.so.1", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.840085 open("i686/librt.so.1", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.840209 open("librt.so.1", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.840329 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/librt.so.1", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.840484 open("/lib/tls/librt.so.1", O_RDONLY) = 3
15:12:54.840622 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\320`\206"..., 512) = 512
15:12:54.840770 fstat64(3, {st_mode=S_IFREG|0755, st_size=36560, ...}) = 0
15:12:54.841065 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f43000
15:12:54.841224 old_mmap(0x864000, 81656, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x864000
15:12:54.841381 old_mmap(0x86c000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7000) = 0x86c000
15:12:54.841557 old_mmap(0x86e000, 40696, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x86e000
15:12:54.841724 close(3)                = 0
15:12:54.841837 open("tls/i686/libm.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.841972 open("tls/libm.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.842095 open("i686/libm.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.842262 open("libm.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.842386 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/libm.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.842547 open("/lib/tls/libm.so.6", O_RDONLY) = 3
15:12:54.842799 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\0ce\000"..., 512) = 512
15:12:54.842955 fstat64(3, {st_mode=S_IFREG|0755, st_size=138684, ...}) = 0
15:12:54.843123 old_mmap(0x653000, 139424, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x653000
15:12:54.843274 old_mmap(0x674000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x20000) = 0x674000
15:12:54.843443 close(3)                = 0
15:12:54.843562 open("tls/i686/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.843698 open("tls/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.843822 open("i686/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.843945 open("libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.844067 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
15:12:54.844228 open("/lib/tls/libc.so.6", O_RDONLY) = 3
15:12:54.844372 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\20_\27"..., 512) = 512
15:12:54.844519 fstat64(3, {st_mode=S_IFREG|0755, st_size=1212936, ...}) = 0
15:12:54.844688 old_mmap(0x161000, 1215644, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x161000
15:12:54.844841 old_mmap(0x284000, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x123000) = 0x284000
15:12:54.845871 old_mmap(0x288000, 7324, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x288000
15:12:54.846073 close(3)                = 0
15:12:54.846237 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f42000
15:12:54.846647 mprotect(0x284000, 4096, PROT_READ) = 0
15:12:54.846962 mprotect(0x15d000, 4096, PROT_READ) = 0
15:12:54.847139 set_thread_area({entry_number:-1 -> 6, base_addr:0xb7f426c0, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0
15:12:54.847271 munmap(0xb7f44000, 87369) = 0
15:12:54.847393 set_tid_address(0xb7f42708) = 5220
15:12:54.847515 rt_sigaction(SIGRTMIN, {0x7a7380, [], SA_SIGINFO}, NULL, 8) = 0
15:12:54.847672 rt_sigaction(SIGRT_1, {0x7a73f0, [], SA_RESTART|SA_SIGINFO}, NULL, 8) = 0
15:12:54.847802 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
15:12:54.847928 getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM_INFINITY}) = 0
15:12:54.848081 _sysctl({{CTL_KERN, KERN_VERSION}, 2, 0xbfa5880c, 30, (nil), 0}) = 0
15:12:54.848431 brk(0)                  = 0x8133000
15:12:54.848629 brk(0x8134000)          = 0x8134000
15:12:54.848744 brk(0x8138000)          = 0x8138000
15:12:54.848876 ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfa58938) = -1 EINVAL (Invalid argument)
15:12:54.849084 brk(0x813c000)          = 0x813c000
15:12:54.849221 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfa58938) = -1 EINVAL (Invalid argument)
15:12:54.849349 brk(0x8140000)          = 0x8140000
15:12:54.849471 ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfa58938) = -1 EINVAL (Invalid argument)
15:12:54.849614 rt_sigaction(SIGFPE, {0x809a9b4, [FPE], SA_SIGINFO}, NULL, 8) = 0
15:12:54.849763 rt_sigaction(SIGHUP, {SIG_DFL}, NULL, 8) = 0
15:12:54.849882 rt_sigaction(SIGINT, {SIG_DFL}, NULL, 8) = 0
15:12:54.849999 rt_sigaction(SIGQUIT, {0x80ad77c, [], 0}, NULL, 8) = 0
15:12:54.850124 rt_sigaction(SIGALRM, {0x80ad77c, [], 0}, NULL, 8) = 0
15:12:54.850268 mmap2(NULL, 401408, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ee0000
15:12:54.850543 gettimeofday({1189365174, 850723}, NULL) = 0
15:12:54.850791 open("/etc/resolv.conf", O_RDONLY) = 3
15:12:54.850954 fstat64(3, {st_mode=S_IFREG|0644, st_size=79, ...}) = 0
15:12:54.851127 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f59000
15:12:54.851243 read(3, "nameserver 24.197.160.18\nnameser"..., 4096) = 79
15:12:54.851455 read(3, "", 4096)       = 0
15:12:54.851574 close(3)                = 0
15:12:54.851678 munmap(0xb7f59000, 4096) = 0
15:12:54.851795 uname({sys="Linux", node="home8", ...}) = 0
15:12:54.852132 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
15:12:54.852296 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
15:12:54.852432 connect(3, {sa_family=AF_INET, sin_port=htons(52320), sin_addr=inet_addr("192.168.0.8")}, 16) = 0
15:12:54.853324 write(3, "cmd=init pmi_version=1 pmi_subve"..., 40) = 40
15:12:54.858664 read(3, "cmd=response_to_init pmi_version"..., 1023) = 57
15:12:54.858894 write(3, "cmd=get_maxes\n", 14) = 14
15:12:54.859677 read(3, "cmd=maxes kvsname_max=4096 keyle"..., 1023) = 59
15:12:54.859872 write(3, "cmd=get_appnum\n", 15) = 15
15:12:54.860576 read(3, "cmd=appnum appnum=0\n", 1023) = 20
15:12:54.860759 brk(0x8141000)          = 0x8141000
15:12:54.860895 write(3, "cmd=get_my_kvsname\n", 19) = 19
15:12:54.861660 read(3, "cmd=my_kvsname kvsname=kvs_home8"..., 1023) = 43
15:12:54.861857 brk(0x8142000)          = 0x8142000
15:12:54.861996 write(3, "cmd=get_my_kvsname\n", 19) = 19
15:12:54.862718 read(3, "cmd=my_kvsname kvsname=kvs_home8"..., 1023) = 43
15:12:54.862927 brk(0x8143000)          = 0x8143000
15:12:54.863084 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 4
15:12:54.863236 fcntl64(4, F_GETFL)     = 0x2 (flags O_RDWR)
15:12:54.863350 fcntl64(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
15:12:54.863470 bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
15:12:54.863621 listen(4, 128)          = 0
15:12:54.863775 getsockname(4, {sa_family=AF_INET, sin_port=htons(46099), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
15:12:54.863975 brk(0x8144000)          = 0x8144000
15:12:54.864101 brk(0x8145000)          = 0x8145000
15:12:54.864265 write(3, "cmd=put kvsname=kvs_home8_39119_"..., 117) = 117
15:12:54.866026 read(3, "cmd=put_result rc=0\n", 1023) = 20
15:12:54.866212 write(3, "cmd=barrier_in\n", 15) = 15
15:12:54.867231 read(3, "cmd=barrier_out\n", 1023) = 16
15:12:54.920090 rt_sigprocmask(SIG_BLOCK, [QUIT ALRM], NULL, 8) = 0
15:12:54.920283 rt_sigprocmask(SIG_UNBLOCK, [QUIT ALRM], NULL, 8) = 0
15:12:54.920349 rt_sigprocmask(SIG_BLOCK, [QUIT ALRM], NULL, 8) = 0
15:12:54.920416 rt_sigprocmask(SIG_UNBLOCK, [QUIT ALRM], NULL, 8) = 0
15:12:54.920483 rt_sigprocmask(SIG_BLOCK, [QUIT ALRM], NULL, 8) = 0
15:12:54.920557 access("testfile1", F_OK) = -1 ENOENT (No such file or directory)
15:12:54.920631 access(".", R_OK)       = 0
15:12:54.920687 access(".", W_OK)       = 0
15:12:54.920748 stat64("testfile1", 0xbfa56800) = -1 ENOENT (No such file or directory)
15:12:54.920816 open("testfile1", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 7
15:12:54.920956 brk(0x8149000)          = 0x8149000
15:12:54.921032 ioctl(7, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfa57788) = -1 ENOTTY (Inappropriate ioctl for device)
15:12:54.921102 rt_sigprocmask(SIG_UNBLOCK, [QUIT ALRM], NULL, 8) = 0
15:12:54.921164 rt_sigprocmask(SIG_BLOCK, [QUIT ALRM], NULL, 8) = 0
15:12:54.921308 brk(0x814a000)          = 0x814a000
15:12:54.921448 brk(0x814c000)          = 0x814c000
15:12:54.921675 brk(0x8150000)          = 0x8150000
15:12:54.922074 brk(0x8158000)          = 0x8158000
15:12:54.922918 brk(0x8167000)          = 0x8167000
15:12:54.924345 brk(0x8187000)          = 0x8187000
15:12:54.927224 mmap2(NULL, 258048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ea1000
15:12:54.933745 mmap2(NULL, 516096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7e23000
15:12:54.934780 munmap(0xb7ea1000, 258048) = 0
15:12:54.940880 mmap2(NULL, 401408, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7dc1000
15:12:54.943471 write(7, "\200\32\6\0@$tH\200$tH\300$tH\0%tH@%tH\200%tH\300%tH"..., 400008) = 400008
15:12:54.945790 ftruncate64(7, 400008)  = 0
15:12:54.945888 _llseek(7, 0, [400008], SEEK_END) = 0
15:12:54.945954 ftruncate64(7, 400008)  = 0
15:12:54.946010 _llseek(7, 0, [400008], SEEK_END) = 0
15:12:54.946069 rt_sigprocmask(SIG_UNBLOCK, [QUIT ALRM], NULL, 8) = 0
15:12:54.946159 rt_sigprocmask(SIG_BLOCK, [QUIT ALRM], NULL, 8) = 0
15:12:54.946223 munmap(0xb7dc1000, 401408) = 0
15:12:54.946316 close(7)                = 0
15:12:54.946374 munmap(0xb7e23000, 516096) = 0
15:12:54.946460 brk(0x8145000)          = 0x8145000
15:12:54.946540 rt_sigprocmask(SIG_UNBLOCK, [QUIT ALRM], NULL, 8) = 0
15:12:54.946641 write(3, "cmd=get kvsname=kvs_home8_39119_"..., 56) = 56
15:12:54.948536 read(3, "cmd=get_result rc=0 value=port#4"..., 1023) = 81
15:12:54.948696 brk(0x8146000)          = 0x8146000
15:12:54.948774 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7
15:12:54.948852 fcntl64(7, F_GETFL)     = 0x2 (flags O_RDWR)
15:12:54.948910 fcntl64(7, F_SETFL, O_RDWR|O_NONBLOCK) = 0
15:12:54.948994 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
15:12:54.949061 connect(7, {sa_family=AF_INET, sin_port=htons(49392), sin_addr=inet_addr("192.168.0.8")}, 16) = -1 EINPROGRESS (Operation now in progress)
15:12:54.949235 poll([{fd=4, events=POLLIN, revents=POLLIN}, {fd=7, events=POLLOUT, revents=POLLOUT}], 2, -1) = 2
15:12:54.949323 getpeername(7, {sa_family=AF_INET, sin_port=htons(49392), sin_addr=inet_addr("192.168.0.8")}, [16]) = 0
15:12:54.949526 brk(0x8147000)          = 0x8147000
15:12:54.949596 accept(4, {sa_family=AF_INET, sin_port=htons(43533), sin_addr=inet_addr("192.168.0.8")}, [16]) = 8
15:12:54.949699 fcntl64(8, F_GETFL)     = 0x2 (flags O_RDWR)
15:12:54.949752 fcntl64(8, F_SETFL, O_RDWR|O_NONBLOCK) = 0
15:12:54.949808 setsockopt(8, SOL_TCP, TCP_NODELAY, [1], 4) = 0
15:12:54.949871 poll([{fd=4, events=POLLIN}, {fd=7, events=POLLOUT, revents=POLLOUT}, {fd=8, events=POLLIN, revents=POLLIN}], 3, -1) = 2
15:12:54.949946 writev(7, [{"\27\0\0\0\24\0\0\0\1\0\0\0\0\0\0\0\210\32\6\0\0\0\0\0\210"..., 32}, {"kvs_home8_39119_5_0\0", 20}], 2) = 52
15:12:54.950066 read(8, "\27\0\0\0\24\0\0\0\0\0\0\0\0\0\0\0\210\32\6\0\0\0\0\0\210"..., 32) = 32
15:12:54.950149 poll([{fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN, revents=POLLIN}], 3, -1) = 1
15:12:54.950217 read(8, "kvs_home8_39119_5_0\0", 20) = 20
15:12:54.950283 poll([{fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLOUT, revents=POLLOUT}], 3, -1) = 1
15:12:54.950350 write(8, "\31\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\210\32\6\0\0\0\0\0\210"..., 32) = 32
15:12:54.950766 poll([{fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN, revents=POLLIN}], 3, -1) = 1
15:12:54.950834 read(8, "\0\0\0\0\1\0\0\0\0\0\1\0\1\0\0\254\0\0\0\0\1\0\0\0\1\0"..., 32) = 32
15:12:54.950916 poll([{fd=4, events=POLLIN, revents=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}], 3, -1) = 1
15:12:55.026209 brk(0x8148000)          = 0x8148000
15:12:55.026365 accept(4, {sa_family=AF_INET, sin_port=htons(43538), sin_addr=inet_addr("192.168.0.8")}, [16]) = 10
15:12:55.026576 fcntl64(10, F_GETFL)    = 0x2 (flags O_RDWR)
15:12:55.026771 fcntl64(10, F_SETFL, O_RDWR|O_NONBLOCK) = 0
15:12:55.026915 setsockopt(10, SOL_TCP, TCP_NODELAY, [1], 4) = 0
15:12:55.026984 poll([{fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN, revents=POLLIN}], 4, -1) = 1
15:12:55.027057 read(10, "\27\0\0\0\24\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32) = 32
15:12:55.027138 poll([{fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN, revents=POLLIN}], 4, -1) = 1
15:12:55.027209 read(10, "kvs_home8_39119_5_0\0", 20) = 20
15:12:55.027276 poll([{fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLOUT, revents=POLLOUT}], 4, -1) = 1
15:12:55.027347 write(10, "\31\0\0\0\1\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32) = 32
15:12:55.027476 poll([{fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN, revents=POLLIN}], 4, -1) = 1
15:12:55.027824 read(10, "\0\0\0\0\1\0\0\0\3\0\1\0\1\0\0\254\0\0\0\0\1\0\0\0\1\0"..., 32) = 32
15:12:55.027909 poll([{fd=4, events=POLLIN}, {fd=7, events=POLLIN, revents=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}], 4, -1) = 1
15:12:55.028100 read(7, "\31\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\210\32\6\0\0\0\0\0\210"..., 32) = 32
15:12:55.028202 poll([{fd=4, events=POLLIN}, {fd=7, events=POLLIN|POLLOUT, revents=POLLOUT}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}], 4, -1) = 1
15:12:55.028276 writev(7, [{"\0\0\0\0\1\0\0\0\1\0\1\0\1\0\0\254\0\0\0\0\1\0\0\0\1\0"..., 32}], 1) = 32
15:12:55.028378 write(10, "\0\0\0\0\1\0\0\0\1\0\1\0\0\0\0\254\0\0\0\0\1\0\0\0\1\0"..., 32) = 32
15:12:55.028515 write(8, "\25\0\0\0\0\0\0\0\0008\24\10\360\254\17\10$\213\245\277"..., 32) = 32
15:12:55.028750 write(7, "\25\0\0\0\0\0\0\0\0008\24\10\360\254\17\10$\213\245\277"..., 32) = 32
15:12:55.029021 write(10, "\25\0\0\0\0\0\0\0\0008\24\10\360\254\17\10$\213\245\277"..., 32) = 32
15:12:55.029188 poll([{fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN, revents=POLLIN}], 4, -1) = 1
15:12:55.029449 read(10, "\25\0\0\0\0\0\0\0\0008\24\10\360\254\17\10t\223\354\277"..., 32) = 32
15:12:55.029603 write(10, "\25\0\0\0\1\0\0\0\315\343!\0\1\0\0\0\0\0\0\0\0\0\0\0hP"..., 32) = 32
15:12:55.029763 poll([{fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN, revents=POLLIN}], 4, -1) = 1
15:12:55.029955 read(10, "\25\0\0\0\1\0\0\0\315\343!\0\1\0\0\0\0\0\0\0\0\0\0\0hP"..., 32) = 32
15:12:55.030105 fcntl64(10, F_GETFL)    = 0x802 (flags O_RDWR|O_NONBLOCK)
15:12:55.030238 fcntl64(10, F_SETFL, O_RDWR) = 0
15:12:55.030343 close(10)               = 0
15:12:55.030473 poll([{fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN, revents=POLLIN}], 3, -1) = 1
15:12:55.032021 read(8, "\25\0\0\0\1\0\0\0\0008\24\10\360\254\17\10dz\260\277\360"..., 32) = 32
15:12:55.032098 write(8, "\25\0\0\0\1\0\0\0\315\343!\0\1\0\0\0\0\0\0\0\0\0\0\0hP"..., 32) = 32
15:12:55.032178 fcntl64(8, F_GETFL)     = 0x802 (flags O_RDWR|O_NONBLOCK)
15:12:55.032233 fcntl64(8, F_SETFL, O_RDWR) = 0
15:12:55.032283 close(8)                = 0
15:12:55.032348 poll([{fd=4, events=POLLIN}, {fd=7, events=POLLIN, revents=POLLIN}], 2, -1) = 1
15:12:55.035104 read(7, "\25\0\0\0\1\0\0\0\0008\24\10\360\254\17\10tw\306\277\0"..., 32) = 32
15:12:55.035187 write(7, "\25\0\0\0\1\0\0\0\315\343!\0\1\0\0\0\0\0\0\0\0\0\0\0hP"..., 32) = 32
15:12:55.035271 fcntl64(7, F_GETFL)     = 0x802 (flags O_RDWR|O_NONBLOCK)
15:12:55.035326 fcntl64(7, F_SETFL, O_RDWR) = 0
15:12:55.035375 close(7)                = 0
15:12:55.035449 fcntl64(4, F_GETFL)     = 0x802 (flags O_RDWR|O_NONBLOCK)
15:12:55.035502 fcntl64(4, F_SETFL, O_RDWR) = 0
15:12:55.035551 close(4)                = 0
15:12:55.035626 write(3, "cmd=finalize\n", 13) = 13
15:12:55.036095 read(3, "cmd=finalize_ack\n", 1023) = 17
15:12:55.036191 shutdown(3, 2 /* send and receive */) = 0
15:12:55.036434 close(3)                = 0
15:12:55.036513 munmap(0xb7ee0000, 401408) = 0
15:12:55.036625 write(1, "MYRANK:1\n", 9) = 9
15:12:55.037358 exit_group(0)           = ?

strace - The Friend You Never Knew You Had

A Simple Serial Example

MPI Example

Final Comments

`strace` - The Friend You Never Knew You Had