Article Index

A special article for IO Knuckleheads

In my experience there are quite a few people who don't really know the IO performance of their application(s) nor the IO pattern. So rather than complain about this situation, as I'm apt to do, I decided to write a column that talks about how to capture the IO behavior of your code(s). These steps can be done for codes where you only have the binary, not necessarily the source. It's fairly easy to capture the information. Processing it is another story, however.

This article will give you an introduction in how to use strace, a commonly used debugging tool for system administrators, as a tool to gather information on the IO pattern of your code(s). I'll show you how to use and gather the information, including how you use it for MPI codes. Read on if you want to learn more about how to use this simple tool to make yourself a better cluster, well..., monkey.

strace - The Friend You Never Knew You Had

strace is one of those tools that system administrators use to track down problems. Here's an article that is an example of how people use strace to debug or troubleshoot problems in *nix operating systems. But as we will see, strace is one of those tools that is multi-purpose.

Before we go too much further, let's look at the man pages for strace. I'm looking at the man pages on my CentOS system but I won't be reproducing the entire man page here. I'll leave that up to you. But I will extract certain sections that I think are relevant. Normally I hate reading man pages, but in this case, I think it's important to get an idea of what strace can do.

NAME
   strace - trace system calls and signals

SYNOPSIS
   strace [ -dffhiqrtttTvxx ] [ -acolumn ] [ -eexpr ] ...  [ -ofile ] [ -ppid ] 
          ...  [ -sstrsize ] [ -uusername ] [ -Evar=val ] ...  [ -Evar ] ...  
          [ command [ arg ...  ] ]

   strace -c [ -eexpr ] ...  [ -Ooverhead ] [ -Ssortby ] [ command [ arg ...  ] ]

DESCRIPTION
   In the simplest case strace runs the specified command until it  exits.   It
   intercepts  and  records  the system calls which are called by a process and
   the signals which are received by  a process.  The name of each system call,   
   its arguments and its return value are  printed  on standard error or to the  
   file specified with the -o option.

   Each line in the trace contains the system call name, followed by its argum-
   ents in  parentheses and its return value.  

OPTIONS

   -t          Prefix each line of the trace with the time of day.
   -tt         If given twice, the time printed will include the microseconds.

   -o filename Write the trace output to the file filename rather than  to
               stderr.   Use filename.pid  if -ff is used.  If the argument
               begins with ?|? or with ?!? then the rest of the argument is 
               treated as a command and  all  output  is piped to it.  This 
               is convenient for piping the debugging output to a program 
               without affecting the redirections of executed programs.

                                    2003-01-21                      STRACE(1)

OK, this is a lot to swallow, but I've listed the options I think are worthwhile to achieve our goals. If you look at the above synopsis and compare it to the entire man page, you'll see there are lots of other options that I haven't included. If you are interested in them, please look at the man pages.

The first option of interest,-tt, is the timing of the system call in microseconds. The reason that I want to use this option is that it will tell us at what time each system call happens, in particular IO functions such as read, write, lseek, etc. We can use this information to compute how much time each function call takes which we can turn into IO rates (MB/s). The second option is the name of the output file. This option, -o allows you to dump all of the strace output to a file. (A screen dump can get rather long!)

When you run a code the first thing you type is the name of the binary itself. If you want to get a directory listing you can type something like, ls -s. When you use strace you run the strace command first, followed but the binary you want to examine. The binary is actually started by strace. So the simple ls -s example looks like, strace ls -s.

I've made a couple of assumptions in only using these two options. The first one is that the application does not spawn any tasks that do IO. Also, there are no threads beside thread 0 that perform any IO. I hope you can live within these assumptions. If you can't, you can still get the strace output, but it's a bit harder to process.

A Simple Serial Example

I'm sure there are a number of people who already know how to use strace to examine the IO of a code, but for those who don't know how, let's walk through a simple example. I'm going to use Fortran 90 to write a simple example. My apologies to the other programmers, but it's the language I've chosen (I'll give a C example later). Who knows, maybe we can turn you into a Fortran coder yet.

The code below is a simple Fortran90 code that just creates an array of real numbers and writes it to a unformatted (binary) file. Pretty simple so far - huh?

PROGRAM SERIAL
   IMPLICIT NONE

   INTEGER :: I
   INTEGER :: BUFSIZE
   REAL :: BUF(100000)
! ----------------------------------------------------------------------
   BUFSIZE = 100000
!
   DO I=1,BUFSIZE
      BUF(I)= 2.5 * BUFSIZE + I
   ENDDO
!
   OPEN(7,FILE="testfile",FORM="UNFORMATTED")
   WRITE(7) BUF(1:BUFSIZE)
   CLOSE(7)
!
END PROGRAM SERIAL

I compiled the program with g95.

g95 -O2 serial.f90 -o serial 
I then ran the binary, serial on my system but I used strace to capture the function calls from the binary to a file called serial.strace.
strace -tt -o serial.strace ./serial 

Now, let's look at the output (warning - it's quite long).

19:54:31.807138 execve("./serial", ["./serial"], [/* 39 vars */]) = 0
19:54:31.807367 uname({sys="Linux", node="home8", ...}) = 0
19:54:31.807567 brk(0)                  = 0x80a0000
19:54:31.807653 access("/etc/ld.so.preload", R_OK) = -1 ENOENT 
(No such file or directory)
19:54:31.807752 open("tls/i686/libm.so.6", O_RDONLY) = -1 ENOENT 
(No such file or directory)
19:54:31.807814 open("tls/libm.so.6", O_RDONLY) = -1 ENOENT 
(No such file or directory)
19:54:31.807873 open("i686/libm.so.6", O_RDONLY) = -1 ENOENT 
(No such file or directory)
19:54:31.807931 open("libm.so.6", O_RDONLY) = -1 ENOENT 
(No such file or directory)
19:54:31.807988 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/tls/
i686/libm.so.6", O_RDONLY) = -1 ENOENT (
No such file or directory)
19:54:31.808066 stat64("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/
tls/i686", 0xbfa589bc) = -1 ENOENT (No such file or directory)
19:54:31.808136 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/tls/libm.so.6", 
O_RDONLY) = -1 ENOENT (No su
ch file or directory)
19:54:31.808207 stat64("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/tls", 0xbfa589bc) = -1 ENOENT (No such fil
e or directory)
19:54:31.808276 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/i686/libm.so.6", O_RDONLY) = -1 ENOENT (No s
uch file or directory)
19:54:31.808347 stat64("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/i686", 0xbfa589bc) = -1 ENOENT (No such fi
le or directory)
19:54:31.808415 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/libm.so.6", O_RDONLY) = -1 ENOENT (No such f
ile or directory)
19:54:31.808485 stat64("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib", {st_mode=S_IFDIR|0775, st_size=4096, ...
}) = 0
19:54:31.808593 open("/etc/ld.so.cache", O_RDONLY) = 3
19:54:31.808655 fstat64(3, {st_mode=S_IFREG|0644, st_size=87369, ...}) = 0
19:54:31.808734 old_mmap(NULL, 87369, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7f44000
19:54:31.808808 close(3)                = 0
19:54:31.808874 open("/lib/tls/libm.so.6", O_RDONLY) = 3
19:54:31.808945 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\0ce\000"..., 512) = 512
19:54:31.809029 fstat64(3, {st_mode=S_IFREG|0755, st_size=138684, ...}) = 0
19:54:31.809110 old_mmap(0x653000, 139424, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x653000
19:54:31.809180 old_mmap(0x674000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x20000)
 = 0x674000
19:54:31.809270 close(3)                = 0
19:54:31.809323 open("tls/i686/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
19:54:31.809382 open("tls/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
19:54:31.809440 open("i686/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
19:54:31.809497 open("libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
19:54:31.809553 open("/home/laytonj/bin/mpich2-1.0.5p4-g95-old/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such f
ile or directory)
19:54:31.809631 open("/lib/tls/libc.so.6", O_RDONLY) = 3
19:54:31.809696 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\20_\27"..., 512) = 512
19:54:31.809768 fstat64(3, {st_mode=S_IFREG|0755, st_size=1212936, ...}) = 0
19:54:31.809846 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f43000
19:54:31.809922 old_mmap(0x161000, 1215644, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x161000
19:54:31.810016 old_mmap(0x284000, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x12300
0) = 0x284000
19:54:31.810107 old_mmap(0x288000, 7324, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x
288000
19:54:31.810181 close(3)                = 0
19:54:31.810254 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f42000
19:54:31.810601 mprotect(0x284000, 4096, PROT_READ) = 0
19:54:31.810693 mprotect(0x15d000, 4096, PROT_READ) = 0
19:54:31.810748 set_thread_area({entry_number:-1 -> 6, base_addr:0xb7f426c0, limit:1048575, seg_32bit:1, conten
ts:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0
19:54:31.810811 munmap(0xb7f44000, 87369) = 0
19:54:31.810997 brk(0)                  = 0x80a0000
19:54:31.811077 brk(0x80a1000)          = 0x80a1000
19:54:31.811137 brk(0x80a5000)          = 0x80a5000
19:54:31.811205 ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
19:54:31.811302 brk(0x80a9000)          = 0x80a9000
19:54:31.811366 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
19:54:31.811432 brk(0x80ad000)          = 0x80ad000
19:54:31.811494 ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
19:54:31.811573 rt_sigaction(SIGFPE, {0x8053654, [FPE], SA_SIGINFO}, NULL, 8) = 0
19:54:31.811645 rt_sigaction(SIGHUP, {SIG_DFL}, NULL, 8) = 0
19:54:31.811700 rt_sigaction(SIGINT, {SIG_DFL}, NULL, 8) = 0
19:54:31.811755 rt_sigaction(SIGQUIT, {0x806641c, [], 0}, NULL, 8) = 0
19:54:31.811813 rt_sigaction(SIGALRM, {0x806641c, [], 0}, NULL, 8) = 0
19:54:31.811883 mmap2(NULL, 401408, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ee0000
19:54:31.813013 rt_sigprocmask(SIG_BLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.813178 access("testfile", F_OK) = -1 ENOENT (No such file or directory)
19:54:31.813251 access(".", R_OK)       = 0
19:54:31.813305 access(".", W_OK)       = 0
19:54:31.813374 stat64("testfile", 0xbfa57190) = -1 ENOENT (No such file or directory)
19:54:31.813439 open("testfile", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 3
19:54:31.813541 brk(0x80b1000)          = 0x80b1000
19:54:31.813613 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfa58118) = -1 ENOTTY (Inappropriate ioctl for device
)
19:54:31.813688 rt_sigprocmask(SIG_UNBLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.813747 rt_sigprocmask(SIG_BLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.813900 brk(0x80b2000)          = 0x80b2000
19:54:31.814157 brk(0x80b4000)          = 0x80b4000
19:54:31.814392 brk(0x80b8000)          = 0x80b8000
19:54:31.814782 brk(0x80c0000)          = 0x80c0000
19:54:31.815543 brk(0x80cf000)          = 0x80cf000
19:54:31.816942 brk(0x80ef000)          = 0x80ef000
19:54:31.825402 mmap2(NULL, 258048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ea1000
19:54:31.831774 mmap2(NULL, 516096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7e23000
19:54:31.832821 munmap(0xb7ea1000, 258048) = 0
19:54:31.840115 mmap2(NULL, 401408, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7dc1000
19:54:31.841593 write(3, "\200\32\6\0@$tH\200$tH\300$tH\0%tH@%tH\200%tH\300%tH"..., 400008) = 400008
19:54:31.843949 ftruncate64(3, 400008)  = 0
19:54:31.844049 _llseek(3, 0, [400008], SEEK_END) = 0
19:54:31.844112 ftruncate64(3, 400008)  = 0
19:54:31.844165 _llseek(3, 0, [400008], SEEK_END) = 0
19:54:31.844220 rt_sigprocmask(SIG_UNBLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.844285 rt_sigprocmask(SIG_BLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.844344 munmap(0xb7dc1000, 401408) = 0
19:54:31.844444 close(3)                = 0
19:54:31.844499 munmap(0xb7e23000, 516096) = 0
19:54:31.844593 brk(0x80ad000)          = 0x80ad000
19:54:31.844670 rt_sigprocmask(SIG_UNBLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.844727 munmap(0xb7ee0000, 401408) = 0
19:54:31.844846 exit_group(0)           = ?

Now we're getting to the good stuff! We can ignore most of the stuff in the output since we're really interested in the IO part of it. So I've extracted the relevant IO portion(s) of the output below:

19:54:31.813178 access("testfile", F_OK) = -1 ENOENT (No such file or directory)
19:54:31.813251 access(".", R_OK)       = 0
19:54:31.813305 access(".", W_OK)       = 0
19:54:31.813374 stat64("testfile", 0xbfa57190) = -1 ENOENT (No such file or directory)
19:54:31.813439 open("testfile", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 3
19:54:31.813541 brk(0x80b1000)          = 0x80b1000
19:54:31.813613 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfa58118) = -1 ENOTTY (Inappropriate ioctl for device
)
19:54:31.813688 rt_sigprocmask(SIG_UNBLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.813747 rt_sigprocmask(SIG_BLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.813900 brk(0x80b2000)          = 0x80b2000
19:54:31.814157 brk(0x80b4000)          = 0x80b4000
19:54:31.814392 brk(0x80b8000)          = 0x80b8000
19:54:31.814782 brk(0x80c0000)          = 0x80c0000
19:54:31.815543 brk(0x80cf000)          = 0x80cf000
19:54:31.816942 brk(0x80ef000)          = 0x80ef000
19:54:31.825402 mmap2(NULL, 258048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ea1000
19:54:31.831774 mmap2(NULL, 516096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7e23000
19:54:31.832821 munmap(0xb7ea1000, 258048) = 0
19:54:31.840115 mmap2(NULL, 401408, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7dc1000
19:54:31.841593 write(3, "\200\32\6\0@$tH\200$tH\300$tH\0%tH@%tH\200%tH\300%tH"..., 400008) = 400008
19:54:31.843949 ftruncate64(3, 400008)  = 0
19:54:31.844049 _llseek(3, 0, [400008], SEEK_END) = 0
19:54:31.844112 ftruncate64(3, 400008)  = 0
19:54:31.844165 _llseek(3, 0, [400008], SEEK_END) = 0
19:54:31.844220 rt_sigprocmask(SIG_UNBLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.844285 rt_sigprocmask(SIG_BLOCK, [QUIT ALRM], NULL, 8) = 0
19:54:31.844344 munmap(0xb7dc1000, 401408) = 0
19:54:31.844444 close(3)                = 0

There is still some junk in here that we aren't concerned about. I've left them in the above output to help guide through the good bits which I have marked in bold. Let's walk through them a step at a time.

The first five bold lines,

19:54:31.813178 access("testfile", F_OK) = -1 ENOENT (No such file or directory)
19:54:31.813251 access(".", R_OK)       = 0
19:54:31.813305 access(".", W_OK)       = 0
19:54:31.813374 stat64("testfile", 0xbfa57190) = -1 ENOENT (No such file or directory)
19:54:31.813439 open("testfile", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 3

simply opens the output file, testfile. The first line is checking if the file exists (it doesn't - hence the message "No such file or directory"). The second line checks the directory for read access (it can) and the third line checks for write access (it can). The fourth line returns a status of the file (primarily to see if it exists - and it doesn't). The file is finally opened on the fifth line,

open("testfile", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 3

Then the code does some other things, not pertaining to IO, until the line,

19:54:31.841593 write(3, "\200\32\6\0@$tH\200$tH\300$tH\0%tH@%tH\200%tH\300%tH"..., 400008) = 400008
You can find out what the function write does by looking in the man pages, but you may have to use the command man 2 write to get the correct man pages. The function write writes data to a file descriptor (basically a file). You can see the data that is being written inside the quotations. For this example, the data is binary so the data doesn't many any sense (unless you are real geek of course). But if the data is text or non-binary data, then you might be able to read the data. strace does not list all of the data being written to the file. The very last number, after the equal sign, is the number of bytes actually written to the file descriptor. In this case, it's 400,008 bytes.

The write function call also contains more useful information. The time that the function begins is printed before the function. I used the -tt option which gives the time including microseconds. If we look at the time for the next function (the next line in the output), we can take the difference between the times to get the elapsed time to perform the previous function. For this example, it is the amount of time it took to perform the write function. Usually it's good enough to look at the seconds and microseconds to determine the amount of time for the write or read function. In this case the elapsed time to perform the write function is,

31.843949 - 31.841593 = 0.0023560 seconds (2356.0 microseconds)
Now we know that the amount of data is 400,008 bytes and the time is 0.0023560 seconds. So the IO rate is,
400,008 bytes / 0.0023560 seconds = 1.697x10^8 bytes/sec
This is the same as 169.78 MB/s assuming that a MB is 1,000,000 bytes. Now you can go through and compute the IO rate for all read and write functions in your strace! (Doesn't this cry out for automation?).

As a side note, the IO rate is pretty high. I was writing to a simple software RAID-1 with two SATA drives and formatted with ext3. The high IO performance is caused by the caching of the file system and possibly even the drive.

You have no rights to post comments

Search

Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.

Feedburner


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.