[Beowulf] ...Re: Benchmark results

Robert G. Brown rgb at phy.duke.edu
Wed Jan 7 11:18:33 EST 2004


On Tue, 6 Jan 2004, Joe Landman wrote:

> Physics 101:
> 
> Take your precision of your clock.  Your uncertainty in using the clock
> as a measuring device is +/- 1/2 of the clocks precision (call this
> delta_t), or 1/2 of the smallest interval that this clock can measure. 
> As you are really interested in time intervals, you may need to appeal
> to basic error analysis.  That said, if you look at a really rough
> estimate of the contribution of the random error of the clock for your
> runs, have a look at the ratio of delta_t to the total run time.  For a
> run of 30 minutes, t ~ 1800 seconds, while using 0.01 seconds as the
> clocks precision gives you about 3 x 10**(-6) (or 1 part in 360000). 
> That is, the contribution to the error due to the clock is small here.  
> 
> Even for a run of 60 seconds it is still in the 0.01% regime.  To get a
> 5% error out of that, you would need delta_t/t ~ 0.05, or t about 0.1
> second for your total benchmark run time.

A Brief Digression on Benchmark Timing

(Yes, I lie like a rug.  It is not brief.  If you already know about
timing, or don't care, hit 'd' now...:-)

To add a bit of pure CPS 1 empiricism to Joe's remarks, I've played
quite a bit with timers while writing cpu_rate and using lmbench and so
forth.  There are two reasonably accurate "simple" timing mechanisms
available in linux, one simpler (but less accurate) than the other.

gettimeofday returns a timeval struct with seconds and microseconds
since the Epoch.  Typical use in a timing loop saves the timeval at an
interval start, runs the loop, saves the timeval at the stop, and
computes the delta.  This generates wall clock delta and EMPIRICALLY is
accurate to about two microseconds, no more.  This makes it nearly
useless for timing short loops -- say a few hundred iterations of a
multiply -- because multiplies take order of a nanosecond (depending on
locality and clock) and one has to execute thousands and thousands of
them to use up more time than the resolution of your clock.  A timing of
2 usec +/- 2 usec isn't very useful.  If you execute a few billion of
them (so the internal loop takes order of seconds to complete) things
are better -- 2 sec +/- 2 usec isn't too bad, precision wise.

However now there is another tradeoff.  A timed interval that occupies
only a hundred microseconds or so has an excellent chance of not being
interrupted (depending, of course, on system environment -- I'm assuming
a basically quiescent network and no other toplevel userspace tasks
running).  One that occupies two seconds will be interrupted hundreds of
times, because if nothing else the system timer will issue interrupts
that the kernel has to handle.  If you run "vmstat 1" on your box,
you'll note that there are roughly 100 interrupts per second steady
state on a quiet system.  Associated with these are 100+ context
switches.  These interrupts and context switches are quite "expensive",
and one is tempted to get off into a discussion of the kernel design and
upper/lower half handlers and ways the system tries to bundle up its
work to minimize impact on running processes, but it would be a moderate
digression.  

Suffice it to say that if you time LOTS of 100 microsecond loops
separately in a continuous process and look at the results, you're
likely to see the bimodal distribution Joe refers to (or even a fully
multimodal distribution) because SOME of those intervals will be
interrupted and the interrupt/context switch will be "timed" along with
the swapped out loop by the wall-clock timer.  This can be somewhat
ameliorated by adding a usleep right before each timing loop iteration,
as the kernel will often take advantage of a task pause to complete
pending interrupts.  However, it alas isn't a magic bullet because not
all interrupts are "pendable".

gettimeofday is actually a mediocre timer for nearly all
microbenchmarking purposes as it is so much slower than the processes
that it times that getting good precision between these competing limits
is not at all easy.  It is just fine for macroscopic, large scale
completion times that average over all sorts of system state changes
(e.g. all those int/css).  A much better timer is available on nearly
all intel/amd cpus.  You can actually read the CPU's onboard cycle
counter with a suitable bit of assembler magic.  By also reading the
CPU's clock (from /proc/cpuinfo) and doing a bit of division, you can
obtain a "time" accurate (empirically, again) to roughly 10-60
nanoseconds.  Very roughly, as this depends strongly on the clock and
other systems architecture differences that can vary a bit from system
to system and whether or not the clock is wrapped up quasi-portably in a
subroutine call (adding a bit of overhead) or inlined as raw assembler
on both sides of the timing loop -- this is just what I measure.

This nanotimer is FAR more suitable for use in microbenchmarking --
relatively small loops where one devoutly hopes to avoid getting nailed
by a context switch as one is interested in "how many floating point
multiplies can the system perform on numbers in its registers/L1
cache/L2 cache/memory per second".  A loop with a few microseconds worth
of operations, fairly likely not to be interrupted if you start
immediately after a timer click, can yield < 1% accuracy.  Multimodality
in a distribution of such short loop timings yields "interesting"
information about the system's CPU, cache, memory architecture, and
their interaction with the kernel.

I've wrapped both of the timers up in a single file in a somewhat
portable form.  Because there can be a LOT of clocks in a second
(several billion, obviously) the nanotimer is a bit complex --
on the first call it saves the base clock registers as a baseline that
it subtracts from subsequent calls BEFORE converting the results to a
double precision nanosecond count.  Done in the wrong order, you can end
up with a significant precision loss subtracting two large numbers to
get a very small number, hence the runaround.  I'm pretty sure it is
done right at this point.  I attach the timer source below for anyone
who cares.

   rgb

> You wouldn't even need another process taking up time, the kernel will
> happily do this for you with some of its housekeeping chores (bdflush,
> kswapd, ...).  Servicing lots of interrupts is a good way to slow things
> down, as are doing many small page allocations where one large one would
> suffice, etc.

Absolutely.  In fact, one of the "fun" things about microbenchmarking is
that a good microbenchmarking suite and/or a set of good vmstat-like
tools can really help you understand and "see" the actual costs of stuff
like this.  Josip Loncaric (on this list) worked very hard some years
ago to run down a moderately hienous intermittant latency hit in the
(IIRC) TCP networking stack, for example.  Every now and then instead of
a predictable relatively short packet latency a really long (again IIRC,
2 msec or the like compared to timings on the order of tens of usec) hit
would appear.

Sending lots of small packets is "bad".  Doing lots of malloc is "bad".
Doing any sort of interrupt is bad, in part because the system has a
tendency to swap in waiting/competing tasks (if any) when the interrupt
returns and of course because of the relatively high cost of the
interrupt itself.  Lots of system calls can have the same effect --
select, for example, has about the same cost as gettimeofday, and AFAICT
nanosleep is really usleep -- the ONLY way I've found to "pause" for a
nanoscale-specified time is to use a nanotimer-based polling loop.

FORTUNATELY, most of this isn't horribly difficult to understand or to
avoid in actual code, and FORTUNATELY most people really are interested
in macroscopic wall-clock completion times of large scale code, where
"date" at the shell is accurate enough to get decent timings.  Where it
makes a difference is in optimizing core loops or understanding how you
spent all this money on a 3 GHz CPU system that IN PRINCIPLE has a 6
GFLOPS peak but in practice you are lucky to get 1 GFLOPS on a good day
with a tail wind.

[Parenthetical Aside:  Youngsters don't know HOW lucky you are to get
that 1 GFLOPS in a $1000 system, of course.  Suffice it to say that as
of a few years ago, your desktop would have been classified as a weapon
for purposes of export, and sending one outside of the US would have
taken a special license, and a decade ago a GFLOPS system (even a
cluster) would have cost order of $1 million.  Gotta love Moore's Law
(and the fact that Our Government, helped enormously by the beowulf
movement, finally wised up to Moore's Law and moved any remaining
weapons boundaries out so far that they are invisible).]

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

nanotimer:  COPYING is GPL v2b, so play on through.  You can get the
whole package in context in cpu_rate from my personal website and I
think on brahma.

%< snip snip snip =======================================================
/*
 *========================================================================
 * $Id: nanotimer.h,v 1.1 2003/06/10 15:41:27 rgb Exp $
 *
 * See copyright in copyright.h and the accompanying file COPYING
 *========================================================================
 */

#include "copyright.h"
#define K 1024

 /*
  * The following are the struct components of the timer(s).  They
  * should let us write one bit of code and use it with either
  * timer transparently.
  */
 typedef struct {
   double (*timer)();
   double start;
   double stop;
   double delta;
   double nsec_granularity;
   char name[K];
 } Nanotimer;

 /*
  * Built in cycle counter timer (Intel or AMD) resolution
  * roughly 1 CPU clock (sub-nanosecond on modern CPUs).
  */
 static double nsec_per_cycle;
 static unsigned long ax_first,dx_first;
 static unsigned long long count_first;

 void nanotimer_rdtsc_init(Nanotimer *new_nanotimer);
 double nanotimer_rdtsc();

 /*
  * Fallback timer using gettimeofday.  Resolution of roughly
  * 2 usec on a good day, but the way we scale test intervals
  * it doesn't usually matter.
  */
 static struct timeval tv_first;

 void nanotimer_gttod_init(Nanotimer *nanotimer);
 double nanotimer_gttod();

%< snip snip snip =======================================================
/*
* $Id: nanotimer.c,v 1.9 2003/02/28 21:18:04 rgb Exp $
*
* See copyright in copyright.h and the accompanying file COPYING
*
*/

/*
 *========================================================================
 * timing and utility sources.  tv_start and tv_stop are globals.
 *========================================================================
 */

#include "cpu_rate.h"

/*
 * The following to programs constitute a NONportable, but very accurate,
 * intel/amd timer for pentia and athlons (tsc flag in /proc/cpuinfo).
 * The init call basically extracts the CPU clock from /proc/cpuinfo
 * and converts it into nanoseconds/cpu clock cycle.  The second reads
 * the cpu cycle counter directly and converts it into nanoseconds.
 *
 * To avoid potential problems with loss of precision on the interval
 * subtraction, this program stores the register contents on the
 * first call and only counts the nanoseconds from the first call, not
 * since the last boot.
 *
 * It is quite fast (order of 40-50 nsec).
 */

void nanotimer_rdtsc_init(Nanotimer *new_nanotimer)
{

 int i,imax,l,numfields;
 char statbuf[K];
 char delim[2],*nextval;
 FILE *cpuinfo_fd;
 double loop_delta_1000,loop_delta_1000000,loop_ratio;

 snprintf(new_nanotimer->name,K,"cpu cycle counter nanotimer");
 if(verbose == V_TIMER){
   printf("#========================================================================\n");
   printf("# Initializing timer %s\n",new_nanotimer->name);
 }
 new_nanotimer->timer = nanotimer_rdtsc;

 /*
  * First we extract the number of nanoseconds per cycle (required!)
  * from /proc/cpuinfo.  AFAICT, this is the easiest/only way to get
  * it, although I don't know about its precision.  This is fairly
  * standard parsing code (ripped from xmlsysd).
  */
 delim[0] = ':';                /* separator */
 delim[1] = (char) NULL;        /* string terminator */
 cpuinfo_fd = fopen("/proc/cpuinfo","r");
 while(-1){

   /* Normal EOF causes break from while loop */
   if((fgets(statbuf,K,cpuinfo_fd) == NULL)) break;

   if(strncmp(statbuf,"cpu MHz",7) == 0) {
     nextval = strtok(statbuf,delim);       /* first field skip */
     nextval = strtok((char *)NULL,delim);  /* second field is it */
     nsec_per_cycle = 1000.0/atof(nextval);
     break;
   }
 }

 fclose(cpuinfo_fd);

 if(nsec_per_cycle == 0.0){
   fprintf(stderr,"Error: Cannot parse out the cpu MHz from /proc/cpuinfo.\n");
   fprintf(stderr,"Cannot use built-in CPU cycle counter as timer.\n");
   fprintf(stderr,"Try the -g flag to use gettimeofday instead.\n");
   exit(0);
 }

 /*
  * Now we set the key elements of the nanotimer struct and TEST its
  * call granularity.
  *
  * We subtract off the time base to ensure that times of
  * order seconds (and then some) have resolvable differences
  * in double precision.  This is therefore the initial call to
  * the cpu cycle timer to extract the initial time base and
  * MUST precede the use of the timer.  We do NOT CHECK to keep
  * the timer fast!
  */
 asm volatile("rdtsc" : "=a" (ax_first), "=d" (dx_first));
 count_first = dx_first;
 count_first = count_first<<32;
 count_first += ax_first;

 /*
  * To determine the granularity, we just start calling it in a loop until
  * we've accumlated a few zillion calls, and see how long they took on
  * average.  We also crudely time the loop itself, and make sure that
  * its timing SCALES WITH SIZE.
  */
 new_nanotimer->stop = 0.0;
 new_nanotimer->delta = 0.0;
 /*
  * Empty loop.  The printf(""); seems more reliable than anything
  * including a nanosleep etc. at ensuring that the test start on a
  * clock cycle break, yielding the most consistent results.
  */
 printf("");
 new_nanotimer->start = nanotimer_rdtsc();
 for(i=0;i<1000;i++){
 }
 new_nanotimer->stop = nanotimer_rdtsc();
 loop_delta_1000 = new_nanotimer->stop - new_nanotimer->start;
 if(verbose == V_TIMER) printf("# loop_delta_1000 = %f\n",loop_delta_1000);

 printf("");
 new_nanotimer->start = nanotimer_rdtsc();
 for(i=0;i<1000000;i++){
 }
 new_nanotimer->stop = nanotimer_rdtsc();
 loop_delta_1000000 = new_nanotimer->stop - new_nanotimer->start;
 if(verbose == V_TIMER) printf("# loop_delta_1000000 = %f\n",loop_delta_1000000);

 loop_ratio = loop_delta_1000000/loop_delta_1000;
 if(verbose == V_TIMER) printf("# loop_ratio = %f (should be near 1000)\n",loop_ratio);

 /*
  * It should get bigger.  I don't much care how MUCH bigger
  */
 if(loop_ratio < 10){
   fprintf(stderr,"Warning:  Empty loop measurement not scaling in nanotimer_rdtsc_init()\n");
   fprintf(stderr,"  The compiler may optimize away empty loops!\n");
 }
 
 new_nanotimer->stop = 0.0;
 new_nanotimer->delta = 0.0;
 /*
  * Full loop.  We start right after a clock tick, we hope, and
  * we subtract out the loop_delta from above.
  */
 printf("");
 new_nanotimer->start = nanotimer_rdtsc();
 for(i=0;i<1000000;i++){
   new_nanotimer->stop = nanotimer_rdtsc();
 }
 new_nanotimer->delta = new_nanotimer->stop - new_nanotimer->start - loop_delta_1000000;
 if(verbose == V_TIMER) printf("# nanotimer delta = %f for 1000000 iterations\n",new_nanotimer->delta);
 
 new_nanotimer->nsec_granularity = new_nanotimer->delta/1000000.0;

 if(verbose == V_TIMER){
   printf("# %s: granularity = %f nanoseconds\n",
      new_nanotimer->name,new_nanotimer->nsec_granularity);
 }

}

double nanotimer_rdtsc()
{
 unsigned long ax, dx;
 unsigned long long count;
 double nanotime;

 asm volatile("rdtsc" : "=a" (ax), "=d" (dx));
 count = dx;
 count = count<<32;
 count += ax;
 count -= count_first;
 nanotime = (double) count;
 nanotime *= nsec_per_cycle;
 /* printf("nanotime = %f\n",nanotime); */
 return(nanotime);
 
}

/*
 * This is a portable nanosecond timer.  It uses gettimeofday (wall clock
 * time) with the time of the first call subtracted off to keep intervals
 * from horribly overflowing the double with irrelevant numbers (causing
 * a loss of precision).  Note that my direct measurements show that
 * gettimeofday() itself takes about 2 usec to complete.
 */

void nanotimer_gttod_init(Nanotimer *new_nanotimer)
{

 int i;
 double loop_delta_1000,loop_delta_1000000,loop_ratio;

 snprintf(new_nanotimer->name,K,"gettimeofday nanotimer");
 if(verbose == V_TIMER){
   printf("#========================================================================\n");
   printf("# Initializing timer %s\n",new_nanotimer->name);
 }
 new_nanotimer->timer = nanotimer_gttod;

 /*
  * Now we set the key elements of the nanotimer struct and TEST its
  * call granularity.
  *
  * To determine the granularity, we just start calling it in a loop until
  * we've accumlated a few zillion calls, and see how long they took on
  * average.  We also crudely time the loop itself, and make sure that
  * its timing SCALES WITH SIZE.
  */
 new_nanotimer->stop = 0.0;
 new_nanotimer->delta = 0.0;
 /*
  * Empty loop.  The printf(""); seems more reliable than anything
  * including a nanosleep etc. at ensuring that the test start on a
  * clock cycle break, yielding the most consistent results.
  */
 printf("");
 new_nanotimer->start = nanotimer_gttod();
 for(i=0;i<1000;i++){
 }
 new_nanotimer->stop = nanotimer_gttod();
 loop_delta_1000 = new_nanotimer->stop - new_nanotimer->start;
 if(verbose == V_TIMER) printf("# loop_delta_1000 = %f\n",loop_delta_1000);

 printf("");
 new_nanotimer->start = nanotimer_gttod();
 for(i=0;i<1000000;i++){
 }
 new_nanotimer->stop = nanotimer_gttod();
 loop_delta_1000000 = new_nanotimer->stop - new_nanotimer->start;
 if(verbose == V_TIMER) printf("# loop_delta_1000000 = %f\n",loop_delta_1000000);

 loop_ratio = loop_delta_1000000/loop_delta_1000;
 if(verbose == V_TIMER) printf("# loop_ratio = %f (should be near 1000)\n",loop_ratio);

 /*
  * It should get bigger.  I don't much care how MUCH bigger
  */
 if(loop_ratio < 10){
   fprintf(stderr,"Warning:  Empty loop measurement not scaling in nanotimer_gttod_init()\n");
   fprintf(stderr,"  The compiler may optimize away empty loops!\n");
 }
 
 new_nanotimer->stop = 0.0;
 new_nanotimer->delta = 0.0;
 /*
  * Full loop.  We start right after a clock tick, we hope, and
  * we subtract out the loop_delta from above.
  */
 printf("");
 new_nanotimer->start = nanotimer_gttod();
 for(i=0;i<1000000;i++){
   new_nanotimer->stop = nanotimer_gttod();
 }
 new_nanotimer->delta = new_nanotimer->stop - new_nanotimer->start - loop_delta_1000000;
 if(verbose == V_TIMER) printf("# nanotimer delta = %f for 1000000 iterations\n",new_nanotimer->delta);
 
 new_nanotimer->nsec_granularity = new_nanotimer->delta/1000000.0;

 if(verbose == V_TIMER){
   printf("# %s: granularity = %f nanoseconds\n",
      new_nanotimer->name,new_nanotimer->nsec_granularity);
 }

}

double nanotimer_gttod()
{

 struct timeval tv_now;
 double nanotime;

 /*
  * This avoids potential precision problems by computing the starting
  * time as of the first call, and measuring all subsequent times
  * relative to this.  Gets rid of a LOT of seconds.
  */
 if((tv_first.tv_sec == 0) && (tv_first.tv_usec == 0)){
    gettimeofday(&tv_first, (struct timezone *) NULL);
 }
 
 gettimeofday(&tv_now, (struct timezone *) NULL);
 nanotime = (double)(tv_now.tv_sec - tv_first.tv_sec);
 nanotime += 1.0e-6*(double)(tv_now.tv_usec - tv_first.tv_usec);

 /* return nanoseconds */	  
 nanotime *= 1.e+9;
 return(nanotime);
 
}


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list