A special article for IO Knuckleheads
In my experience there are quite a few people who don't really know the IO performance of their application(s) nor the IO pattern. So rather than complain about this situation, as I'm apt to do, I decided to write a column that talks about how to capture the IO behavior of your code(s). These steps can be done for codes where you only have the binary, not necessarily the source. It's fairly easy to capture the information. Processing it is another story, however.
This article will give you an introduction in how to use strace, a commonly used debugging tool for system administrators, as a tool to gather information on the IO pattern of your code(s). I'll show you how to use and gather the information, including how you use it for MPI codes. Read on if you want to learn more about how to use this simple tool to make yourself a better cluster, well..., monkey.
strace - The Friend You Never Knew You Had
strace is one of those tools that system administrators use to track down problems. Here's an article that is an example of how people use strace to debug or troubleshoot problems in *nix operating systems. But as we will see, strace is one of those tools that is multi-purpose.
Before we go too much further, let's look at the man pages for
strace. I'm looking at the man pages on my
CentOS system but I
won't be reproducing the entire man page here. I'll leave that up to you.
But I will extract certain sections that I think are relevant. Normally
I hate reading man pages, but in this case, I think it's important to
get an idea of what strace can do.
NAME
strace - trace system calls and signals
SYNOPSIS
strace [ -dffhiqrtttTvxx ] [ -acolumn ] [ -eexpr ] ... [ -ofile ] [ -ppid ]
... [ -sstrsize ] [ -uusername ] [ -Evar=val ] ... [ -Evar ] ...
[ command [ arg ... ] ]
strace -c [ -eexpr ] ... [ -Ooverhead ] [ -Ssortby ] [ command [ arg ... ] ]
DESCRIPTION
In the simplest case strace runs the specified command until it exits. It
intercepts and records the system calls which are called by a process and
the signals which are received by a process. The name of each system call,
its arguments and its return value are printed on standard error or to the
file specified with the -o option.
Each line in the trace contains the system call name, followed by its argum-
ents in parentheses and its return value.
OPTIONS
-t Prefix each line of the trace with the time of day.
-tt If given twice, the time printed will include the microseconds.
-o filename Write the trace output to the file filename rather than to
stderr. Use filename.pid if -ff is used. If the argument
begins with ?|? or with ?!? then the rest of the argument is
treated as a command and all output is piped to it. This
is convenient for piping the debugging output to a program
without affecting the redirections of executed programs.
2003-01-21 STRACE(1)
OK, this is a lot to swallow, but I've listed the options I think are worthwhile to achieve our goals. If you look at the above synopsis and compare it to the entire man page, you'll see there are lots of other options that I haven't included. If you are interested in them, please look at the man pages.
The first option of interest,-tt, is the timing of the system call in microseconds. The reason that I want to use this option is that it will tell us at what time each system call happens, in particular IO functions such as read, write, lseek, etc. We can use this information to compute how much time each function call takes which we can turn into IO rates (MB/s). The second option is the name of the output file. This option, -o allows you to dump all of the strace output to a file. (A screen dump can get rather long!)
When you run a code the first thing you type is the name of the binary itself. If you want to get a directory listing you can type something like, ls -s. When you use strace you run the strace command first, followed but the binary you want to examine. The binary is actually started by strace. So the simple ls -s example looks like, strace ls -s.
I've made a couple of assumptions in only using these two options. The first one is that the application does not spawn any tasks that do IO. Also, there are no threads beside thread 0 that perform any IO. I hope you can live within these assumptions. If you can't, you can still get the strace output, but it's a bit harder to process.
A Simple Serial Example
I'm sure there are a number of people who already know how to use strace to examine the IO of a code, but for those who don't know how, let's walk through a simple example. I'm going to use Fortran 90 to write a simple example. My apologies to the other programmers, but it's the language I've chosen (I'll give a C example later). Who knows, maybe we can turn you into a Fortran coder yet.
The code below is a simple Fortran90 code that just creates an array of
real numbers and writes it to a unformatted (binary) file. Pretty simple
so far - huh?
PROGRAM SERIAL
IMPLICIT NONE
INTEGER :: I
INTEGER :: BUFSIZE
REAL :: BUF(100000)
! ----------------------------------------------------------------------
BUFSIZE = 100000
!
DO I=1,BUFSIZE
BUF(I)= 2.5 * BUFSIZE + I
ENDDO
!
OPEN(7,FILE="testfile",FORM="UNFORMATTED")
WRITE(7) BUF(1:BUFSIZE)
CLOSE(7)
!
END PROGRAM SERIAL
I compiled the program with g95.
g95 -O2 serial.f90 -o serialI then ran the binary, serial on my system but I used strace to capture the function calls from the binary to a file called serial.strace.
strace -tt -o serial.strace ./serial
Now, let's look at the output (warning - it's quite long).
19:54:31.807138 execve("./serial", ["./serial"], [/* 39 vars */]) = 0
19:54:31.807367 uname({sys="Linux", node="home8", ...}) = 0
19:54:31.807567 brk(0) = 0x80a0000
19:54:31.807653 access("/etc/ld.so.preload", R_OK) = -1 ENOENT
(No such file or directory)
19:54:31.807752 open("tls/i686/libm.so.6", O_RDONLY) = -1 ENOENT
(No such file or directory)
19:54:31.807814 open("tls/libm.so.6", O_RDONLY) = -1 ENOENT
(No such file or directory)
19:54:31.807873 open("i686/libm.so.6", O_RDONLY) = -1 ENOENT
(No such file or directory)
19:54:31.807931 open("libm.so.6", O_RDONLY) = -1 ENOENT
(No such file or directory)
19:54:31.807988 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/tls/
i686/libm.so.6", O_RDONLY) = -1 ENOENT (
No such file or directory)
19:54:31.808066 stat64("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/
tls/i686", 0xbfa589bc) = -1 ENOENT (No such file or directory)
19:54:31.808136 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/tls/libm.so.6",
O_RDONLY) = -1 ENOENT (No su
ch file or directory)
19:54:31.808207 stat64("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/tls", 0xbfa589bc) = -1 ENOENT (No such fil
e or directory)
19:54:31.808276 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/i686/libm.so.6", O_RDONLY) = -1 ENOENT (No s
uch file or directory)
19:54:31.808347 stat64("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/i686", 0xbfa589bc) = -1 ENOENT (No such fi
le or directory)
19:54:31.808415 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/libm.so.6", O_RDONLY) = -1 ENOENT (No such f
ile or directory)
19:54:31.808485 stat64("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib", {st_mode=S_IFDIR|0775, st_size=4096, ...
}) = 0
19:54:31.808593 open("/etc/ld.so.cache", O_RDONLY) = 3
19:54:31.808655 fstat64(3, {st_mode=S_IFREG|0644, st_size=87369, ...}) = 0
19:54:31.808734 old_mmap(NULL, 87369, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7f44000
19:54:31.808808 close(3) = 0
19:54:31.808874 open("/lib/tls/libm.so.6", O_RDONLY) = 3
19:54:31.808945 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\0ce\000"..., 512) = 512
19:54:31.809029 fstat64(3, {st_mode=S_IFREG|0755, st_size=138684, ...}) = 0
19:54:31.809110 old_mmap(0x653000, 139424, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x653000
19:54:31.809180 old_mmap(0x674000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x20000)
= 0x674000
19:54:31.809270 close(3) = 0
19:54:31.809323 open("tls/i686/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
19:54:31.809382 open("tls/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
19:54:31.809440 open("i686/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
19:54:31.809497 open("libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
19:54:31.809553 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such f
ile or directory)
19:54:31.809631 open("/lib/tls/libc.so.6", O_RDONLY) = 3
19:54:31.809696 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\20_\27"..., 512) = 512
19:54:31.809768 fstat64(3, {st_mode=S_IFREG|0755, st_size=1212936, ...}) = 0
19:54:31.809846 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f43000
19:54:31.809922 old_mmap(0x161000, 1215644, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x161000
19:54:31.810016 old_mmap(0x284000, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x12300
0) = 0x284000
19:54:31.810107 old_mmap(0x288000, 7324, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x
288000
19:54:31.810181 close(3) = 0
19:54:31.810254 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f42000
19:54:31.810601 mprotect(0x284000, 4096, PROT_READ) = 0
19:54:31.810693 mprotect(0x15d000, 4096, PROT_READ) = 0
19:54:31.810748 set_thread_area({entry_number:-1 -> 6, base_addr:0xb7f426c0, limit:1048575, seg_32bit:1, conten
ts:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0
19:54:31.810811 munmap(0xb7f44000, 87369) = 0
19:54:31.810997 brk(0) = 0x80a0000
19:54:31.811077 brk(0x80a1000) = 0x80a1000
19:54:31.811137 brk(0x80a5000) = 0x80a5000
19:54:31.811205 ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
19:54:31.811302 brk(0x80a9000) = 0x80a9000
19:54:31.811366 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
19:54:31.811432 brk(0x80ad000) = 0x80ad000
19:54:31.811494 ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
19:54:31.811573 rt_sigaction(SIGFPE, {0x8053654, [FPE], SA_SIGINFO}, NULL, 8) = 0
19:54:31.811645 rt_sigaction(SIGHUP, {SIG_DFL}, NULL, 8) = 0
19:54:31.811700 rt_sigaction(SIGINT, {SIG_DFL}, NULL, 8) = 0
19:54:31.811755 rt_sigaction(SIGQUIT, {0x806641c, [], 0}, NULL, 8) = 0
19:54:31.811813 rt_sigaction(SIGALRM, {0x806641c, [], 0}, NULL, 8) = 0
19:54:31.811883 mmap2(NULL, 401408, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ee0000
19:54:31.813013 rt_sigprocmask(SIG_BLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.813178 access("testfile", F_OK) = -1 ENOENT (No such file or directory)
19:54:31.813251 access(".", R_OK) = 0
19:54:31.813305 access(".", W_OK) = 0
19:54:31.813374 stat64("testfile", 0xbfa57190) = -1 ENOENT (No such file or directory)
19:54:31.813439 open("testfile", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 3
19:54:31.813541 brk(0x80b1000) = 0x80b1000
19:54:31.813613 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfa58118) = -1 ENOTTY (Inappropriate ioctl for device
)
19:54:31.813688 rt_sigprocmask(SIG_UNBLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.813747 rt_sigprocmask(SIG_BLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.813900 brk(0x80b2000) = 0x80b2000
19:54:31.814157 brk(0x80b4000) = 0x80b4000
19:54:31.814392 brk(0x80b8000) = 0x80b8000
19:54:31.814782 brk(0x80c0000) = 0x80c0000
19:54:31.815543 brk(0x80cf000) = 0x80cf000
19:54:31.816942 brk(0x80ef000) = 0x80ef000
19:54:31.825402 mmap2(NULL, 258048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ea1000
19:54:31.831774 mmap2(NULL, 516096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7e23000
19:54:31.832821 munmap(0xb7ea1000, 258048) = 0
19:54:31.840115 mmap2(NULL, 401408, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7dc1000
19:54:31.841593 write(3, "\200\32\6\0@$tH\200$tH\300$tH\0%tH@%tH\200%tH\300%tH"..., 400008) = 400008
19:54:31.843949 ftruncate64(3, 400008) = 0
19:54:31.844049 _llseek(3, 0, [400008], SEEK_END) = 0
19:54:31.844112 ftruncate64(3, 400008) = 0
19:54:31.844165 _llseek(3, 0, [400008], SEEK_END) = 0
19:54:31.844220 rt_sigprocmask(SIG_UNBLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.844285 rt_sigprocmask(SIG_BLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.844344 munmap(0xb7dc1000, 401408) = 0
19:54:31.844444 close(3) = 0
19:54:31.844499 munmap(0xb7e23000, 516096) = 0
19:54:31.844593 brk(0x80ad000) = 0x80ad000
19:54:31.844670 rt_sigprocmask(SIG_UNBLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.844727 munmap(0xb7ee0000, 401408) = 0
19:54:31.844846 exit_group(0) = ?
Now we're getting to the good stuff! We can ignore most of the stuff in
the output since we're really interested in the IO part of it. So I've
extracted the relevant IO portion(s) of the output below:
19:54:31.813178 access("testfile", F_OK) = -1 ENOENT (No such file or directory)
19:54:31.813251 access(".", R_OK) = 0
19:54:31.813305 access(".", W_OK) = 0
19:54:31.813374 stat64("testfile", 0xbfa57190) = -1 ENOENT (No such file or directory)
19:54:31.813439 open("testfile", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 3
19:54:31.813541 brk(0x80b1000) = 0x80b1000
19:54:31.813613 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfa58118) = -1 ENOTTY (Inappropriate ioctl for device
)
19:54:31.813688 rt_sigprocmask(SIG_UNBLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.813747 rt_sigprocmask(SIG_BLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.813900 brk(0x80b2000) = 0x80b2000
19:54:31.814157 brk(0x80b4000) = 0x80b4000
19:54:31.814392 brk(0x80b8000) = 0x80b8000
19:54:31.814782 brk(0x80c0000) = 0x80c0000
19:54:31.815543 brk(0x80cf000) = 0x80cf000
19:54:31.816942 brk(0x80ef000) = 0x80ef000
19:54:31.825402 mmap2(NULL, 258048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ea1000
19:54:31.831774 mmap2(NULL, 516096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7e23000
19:54:31.832821 munmap(0xb7ea1000, 258048) = 0
19:54:31.840115 mmap2(NULL, 401408, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7dc1000
19:54:31.841593 write(3, "\200\32\6\0@$tH\200$tH\300$tH\0%tH@%tH\200%tH\300%tH"..., 400008) = 400008
19:54:31.843949 ftruncate64(3, 400008) = 0
19:54:31.844049 _llseek(3, 0, [400008], SEEK_END) = 0
19:54:31.844112 ftruncate64(3, 400008) = 0
19:54:31.844165 _llseek(3, 0, [400008], SEEK_END) = 0
19:54:31.844220 rt_sigprocmask(SIG_UNBLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.844285 rt_sigprocmask(SIG_BLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.844344 munmap(0xb7dc1000, 401408) = 0
19:54:31.844444 close(3) = 0
There is still some junk in here that we aren't concerned about. I've left them in the above output to help guide through the good bits which I have marked in bold. Let's walk through them a step at a time.
The first five bold lines,
19:54:31.813178 access("testfile", F_OK) = -1 ENOENT (No such file or directory)
19:54:31.813251 access(".", R_OK) = 0
19:54:31.813305 access(".", W_OK) = 0
19:54:31.813374 stat64("testfile", 0xbfa57190) = -1 ENOENT (No such file or directory)
19:54:31.813439 open("testfile", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 3
simply opens the output file, testfile. The first line is checking if
the file exists (it doesn't - hence the message "No such file or directory").
The second line checks the directory for read access (it can) and the third
line checks for write access (it can). The fourth line returns a status of
the file (primarily to see if it exists - and it doesn't).
The file is finally opened on the fifth line,
open("testfile", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 3
Then the code does some other things, not pertaining to IO, until the line,
19:54:31.841593 write(3, "\200\32\6\0@$tH\200$tH\300$tH\0%tH@%tH\200%tH\300%tH"..., 400008) = 400008
You can find out what the function write does by looking in the
man pages, but you may have to use the command man 2 write to get
the correct man pages. The function write writes data to a file
descriptor (basically a file). You can see the data that is being written
inside the quotations. For this example, the data is binary so the data
doesn't many any sense (unless you are real geek of course). But if the data is text or non-binary data, then
you might be able to read the data. strace does not list all of
the data being written to the file. The very last number, after the equal
sign, is the number of bytes actually written to the file descriptor.
In this case, it's 400,008 bytes.
The write function call also contains more useful information. The time that the function begins is printed before the function. I used the -tt option which gives the time including microseconds. If we look at the time for the next function (the next line in the output), we can take the difference between the times to get the elapsed time to perform the previous function. For this example, it is the amount of time it took to perform the write function. Usually it's good enough to look at the seconds and microseconds to determine the amount of time for the write or read function. In this case the elapsed time to perform the write function is,
31.843949 - 31.841593 = 0.0023560 seconds (2356.0 microseconds)Now we know that the amount of data is 400,008 bytes and the time is 0.0023560 seconds. So the IO rate is,
400,008 bytes / 0.0023560 seconds = 1.697x10^8 bytes/secThis is the same as 169.78 MB/s assuming that a MB is 1,000,000 bytes. Now you can go through and compute the IO rate for all read and write functions in your strace! (Doesn't this cry out for automation?).
As a side note, the IO rate is pretty high. I was writing to a simple software RAID-1 with two SATA drives and formatted with ext3. The high IO performance is caused by the caching of the file system and possibly even the drive.
MPI Example
What do you do when you strace an MPI code? Ideally you want a file for each MPI process. How do you use strace? in conjunction with mpirun or mpiexec to produce one file per MPI process? One technique I use is to write a main script for the mpirun or mpiexec command. This main script then calls a second script that actually runs the code. In this second script I'll actually put the strace command and the binary. There are a few gotchas that I'll point out in the script. Readers may also be interested two debugging MPI articles by fellow Cluster Monkey Jeff Squyres: MPI: Debugging -- Can You Hear Me Now? and MPI: Debugging in Parallel (in Parallel).
Let's start with a simple example from the
MPI-2
book by Bill Gropp, et. at. In Chapter 2 the authors present a simple
example of an MPI code where each process of N processes writes data
to an individual file (this is usually referred to as N-N IO).
I modified the code to write more data than originally presented.
/* example of parallel Unix write into separate files */
#include "mpi.h"
#include
Being the versatile cluster geek that I am, I re-wrote the code in
Fortran for us older folks.
PROGRAM SEQIO
IMPLICIT NONE
INCLUDE 'mpif.h'
INTEGER :: I
INTEGER :: MYRANK, NUMPROCS, IERROR
INTEGER :: BUFSIZE
REAL :: BUF(100000)
INTEGER :: ISTATUS(MPI_STATUS_SIZE)
CHARACTER*10 FILENAME
CHARACTER*1 RANK1
CHARACTER*2 RANK2
! -------------------------------------------------------
BUFSIZE = 100000
IERROR = 0
CALL MPI_INIT(IERROR)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, MYRANK, IERROR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, NUMPROCS, IERROR)
!
DO I=1,BUFSIZE
BUF(I)= 2.5 * BUFSIZE + I
ENDDO
!
IF (MYRANK < 9 ) THEN
WRITE(6,11) MYRANK
WRITE(RANK1,10) MYRANK
FILENAME = "testfile"//RANK1
10 FORMAT(I1)
11 FORMAT("MYRANK:",I1)
ELSEIF ((MYRANK >= 9).AND.(MYRANK < 99)) THEN
WRITE(6,11) MYRANK
WRITE(RANK2,20) MYRANK
FILENAME = "testfile"//RANK2
20 FORMAT(I2)
ENDIF
!
OPEN(7,FILE=FILENAME,FORM="UNFORMATTED")
WRITE(7) BUF(1:BUFSIZE)
CLOSE(7)
!
CALL MPI_FINALIZE(IERROR)
END PROGRAM SEQIO
Let's spend a little bit of time writing the scripts we need to run the
code and get the strace output. Don't worry if you don't know bash
scripting. I'm not an expert by any stretch, and I usually have to ask
friends for help. But the scripts are simple, and I will
show you the couple of bits of specialized knowledge you need.
I start with a main script that I usually call main.sh that contains all of the setup for the code as well as the command to run the MPI code. For this example, I used MPICH2 for the MPI layer and I used g95 for the Fortran90 compiler and gcc for the C compiler. I won't cover all of the details of how to use MPICH2 since the website covers everything much better than I could. Below is the main script I use.
#!/bin/bash mpiexec -machinefile ./MACHINEFILE -np 4 /home/laytonj/TESTING/code1.shThe first line just says to use bash shell to run the script. The last line is the command to run the MPI code. In this case, it's mpiexec. Notice that what mpiexec actually runs is another script code1.sh.
Before I talk about the script code1.sh, I want to mention that it's fairly easy to adapt main.sh to a job scheduler such as SGE, Torque, PBS-Pro, or LSF. I don't have the space to talk about how to write job scripts for these schedulers, but it's fairly straight forward and there is documentation on the web. If you get stuck, you can always ask on the Beowulf mailing list.
Let's take a look at the meat of the scripts, the script code1.sh.
#!/bin/bash /usr/bin/strace -tt -o /tmp/strace.out.$$ /home/laytonj/TESTING/code1 $@Similar to the main script, this script starts by using the bash shell. The third line is the meat of the script. The first part of the line,
/usr/bin/strace -tt -o /tmp/strace.out.$$starts the code in the same way that we started the serial example, by using the command strace. As before I used the options -tt to get microsecond timing, and -o to point to an output file. Here's where we need to think about how to write the script so that each MPI processes writes to a separate output file.
This is the first bit of bash knowledge that we'll use in our scripts. In the script I have specified the strace output file as,
/tmp/strace.out.$$So the output files will be located in the /tmp directory on each node used in the run. To keep the files separate, I have added $$ to the end of the file name. In bash scripts, this is a special variable that contains the PID (Process ID) of the script. This is just a number that is specific to each MPI process. So now we have separate file names for each MPI process.
The last bit of bash knowledge we need is how to pass command line arguments to our code (if we need them). In this case, we use another predefined bash variable, $@. This allows you to use all of the arguments that were passed to the code1.shscript (arg1, arg2, ...) as arguments to the code itself. To better see how this works, let's look at a simple example to make sure you know how to do pass command line arguments to the code in code1.sh.
There is an IO benchmark called IOR that has a number of arguments you can pass to the code that describe the details of how to run the benchmark. Here's an example,
IOR -r -w -a MPIIO -b 25m -N 4 -s 25 -t 10m -v -oDon't won't worry about what all of the options mean, but let me point out a couple because they can be important for a job scheduler script. The option -N 4 tells the code to use 4 MPI processes. You can change the value of 4 to correspond to what the scheduler defines. Now how do we pass these arguments to the script that actually runs the code?
Sticking with the IOR example the main.sh script looks like,
#!/bin/bash mpiexec -machinefile ./MACHINEFILE -np 4 /home/laytonj/TESTING/code1.sh \ -r -w -a MPIIO -b 25m -N 4 -s 25 -t 10m -v -oNotice how I've taken the command line arguments and put them in the main.sh script. With the $@ bash predefined variable in the code script, the options are passed to the code. The code script doesn't change at all (except for the name of the binary).
#!/bin/bash /usr/bin/strace -tt -o /tmp/strace.out.$$ /home/laytonj/TESTING/IOR $@The only thing that changed was the name of the binary from code1 to IOR. So if you want to change the arguments to a code you have to modify the main script. If your code doesn't have any command line arguments, I would recommend just leaving $@ in the code for future reference.
If you look a little closer at the examples scripts for running IOR, it is interesting to note the differences compared to running it without strace. Normally, we would have a single script to run IOR where the command consists of three parts. In order they are,
- mpirun command (command to start MPI code)
- binary of code to run
- arguments to code
When the job is finished you have to go to each node used in the run, and copy the files from /tmp back to whatever file system is more permanent than /tmp. You could write all of the strace output files to a central file system, but you run the risk that you could get two PIDs that are the same. The chances of this are fairly small, but I don't like to take that chance.
Now that we know how to run our MPI jobs using strace, let's look through a simple example. I'm running the code that I presented earlier. I'm going to run with 4 MPI processes for this article. After I run the code I get four strace.out files.
strace.out.3821 strace.out.3822 strace.out.3823 strace.out.3824The PIDs are numbered sequentially because I ran all 4 MPI processes on the same machine. Let's look at one of the strace output files.
If you look at the strace file, you will notice that it is much longer
than for the serial case we ran. The reason is that now we're running
an MPI code so much of the extra function calls are due to MPI doing
it's thing in the background (i.e. behind our code). The first
strace output file is listed in Sidebar One at the end of this
article. I've extracted a few of the important lines from the output
and put them below.
15:12:54.920557 access("testfile1", F_OK) = -1 ENOENT (No such file or directory)
15:12:54.920631 access(".", R_OK) = 0
15:12:54.920687 access(".", W_OK) = 0
15:12:54.920748 stat64("testfile1", 0xbfa56800) = -1 ENOENT (No such file or directory)
15:12:54.920816 open("testfile1", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 7
...
15:12:54.943471 write(7, "\200\32\6\0@$tH\200$tH\300$tH\0%tH@%tH\200%tH\300%tH"..., 400008) = 400008
15:12:54.945790 ftruncate64(7, 400008) = 0
15:12:54.945888 _llseek(7, 0, [400008], SEEK_END) = 0
15:12:54.945954 ftruncate64(7, 400008) = 0
15:12:54.946010 _llseek(7, 0, [400008], SEEK_END) = 0
If you compare these lines to the ones in the serial code, you can see that they are very similar. Despite having more "junk" in the output, let's look at the IO performance.
The write function call writes the same amount of data, 400,008 bytes. The amount of time to write the data is,
54.945790 - 54.943471 = 0.002319 seconds (2319 micro-seconds).So the IO rate of the write function is,
400,008 bytes / 0.002319 secs. = 1.7249x10^8 bytes/secondThis works out to be 172.49 MB/s. A bit faster than the serial code, but again, I think there are some caching affects.
I won't examine the other 3 strace.out.* files since it's fairly straight forward to compute the write performance for each of them. But we're only compute the IO performance for a single write call. Imagine if you have a number of write and read calls in a single code. Then you have to perform the computations for a number of write and read calls.
Final Comments
I hope this article has given you a bit of an introduction on how to use strace to examine the IO pattern of your code. Notice that I didn't have to modify the code or rebuild it with a new library. This means you run an strace on an existing binary. So it's possible to examine the output pattern of your own codes or ISV codes for which you don't have source code.
This article is only an introduction. The output from a real code is much more extensive and complicated. Plus, you have to examine the strace output carefully to understand the IO patterns.
For example, you need to examine the strace.out files for IO functions in addition to open(), write(), read(), and close(). You want to examine if the code writes a little bit of data (maybe a few kilobytes), then uses lseek to move to a different part of the file, writes some more data, and so on. This IO pattern can be very difficult for file systems to optimize because they don't know how far the code lseeks so they have to assume that the data needs to be flushed to disk. Consequently the file system is writing a bunch of small files which is difficult for any distributed or parallel file system to do well.
Of course, there are IO patterns that some file systems can recognize and adapt to. For example, the code starts at the end of a file, lseeks backward a few megabytes, reads a few kilobytes, lseeks backwards, reads a few kilobytes, and so on. Some distributed or parallel file systems can recognize this IO pattern and perform some optimizations, such as caching the next few megabytes, to improve performance.
The key message of this article is that you need to examine the IO performance of your codes and you can do it without too much difficulty using strace. You may want to examine the IO pattern for several reasons.
- To learn if IO is a performance bottleneck
- To understand your IO pattern
- To more critically examine distributed or parallel file system offerings To just be a better cluster person
I didn't start this article with a rant about how little people know about the IO pattern of the code. But it's true. I've seen people who have no idea how much improving the IO performance will improve the performance of their code. There are also people who claim they know the IO pattern of their code and swear that they have to have a massive high-speed IO system for vastly improved IO performance. They sometimes even quote you huge IO throughputs that they just have to have. But, in actuality they have no idea what improved IO performance will do for their code performance. On the other hand, there are people who have a true understanding of the IO pattern of their code. I like those people.
I hope this article starts your on your way to better understanding the IO pattern of your code so that you don't become one of the knuckleheads I just described.
Processing the strace output files is not easy. In fact, a Perl or Python script that can take the output and produce something that looks readable or even a histogram of the IO performance would be very useful. I've looked on Google and I haven't found any yet, but I think there are a couple out there somewhere. If you find them or if you write one yourself, please contact me through Cluster Monkey or post it to the beowulf mailing list to help everyone else.
Jeff Layton is having way too much fun writing this article, proving that it's hard to keep a good geek down. When he's not creating havoc in his household, he can be found hanging out at the Fry's coffee shop (never during working hours) and admiring the shiny new CPUs that come in, and cringing when someone buys Microsoft Vista. The fund I started to buy some storage for myself is still open and taking donations.
Sidebar One: Example strace from MPI code |
|