[Beowulf] ...Re: Benchmark results

Josip Loncaric josip at lanl.gov
Wed Jan 7 13:41:58 EST 2004


Robert G. Brown wrote:

> A Brief Digression on Benchmark Timing
> 
> (Yes, I lie like a rug.  It is not brief.  If you already know about
> timing, or don't care, hit 'd' now...:-)

Thank you for providing this excellent article.  I'd just like to add a 
few details on fine grained timings, which are indeed useful in tracking 
down some performance quirks...

> gettimeofday is actually a mediocre timer for nearly all
> microbenchmarking purposes

While gettimeofday() is quite portable as a function call, its timing 
accuracy is not.  On some machines (e.g. Linux PCs) the times are 
reliable to within a few microseconds.  On other systems, true 
granularity can be 1000 microseconds, apparently related to the 
scheduler frequency.  This seriously undermines the portability of fine 
grained timings done with gettimeofday().

>  A much better timer is available on nearly
> all intel/amd cpus.

Indeed -- and thank you for your code examples.  One minor simplifying 
comment (I learned this recently) is that GNU C has the ability to read 
the full 64-bit time stamp counter as follows:

   unsigned long long tsc;
   asm volatile ("rdtsc": "=A" (tsc));

where the capitalized "=A" tells gcc that the operand constraints of 
RDTSC involve both a and d registers, as in:

"A -- Specifies the a or d registers. This is primarily useful for 
64-bit integer values (when in 32-bit mode) intended to be returned with 
the d register holding the most significant bits and the a register 
holding the least significant bits."
--http://gcc.gnu.org/onlinedocs/gcc-3.3.2/gcc/Machine-Constraints.html

Regarding the RDTSC instruction, those interested in more detail can consult

  http://cedar.intel.com/software/idap/media/pdf/rdtscpm1.pdf

but I'd like to add a few observations:

(1) Pentiums execute instructions out-of-order.  Although RDTSC 
instruction will read the 64-bit Time Stamp Counter which advances at 
the CPU's clock frequency (~3GHz these days), the meaning of the count 
might not be exactly what you expected.  For maximum reliability, one 
can issue a serializing instruction such as CPUID before RDTSC, at the 
cost of about 300 clock cycles (CPUID can take variable amount of time 
to execute).  BTW, just doing the RDTSC and moving its result to other 
registers takes about 86 cycles on a Pentium III.  Intel's document 
suggests some compensation techniques, but in many cases 300 cycle 
resolution (about 100 nanoseconds) is already OK, so CPUID is optional. 
  Also, older Pentiums (plain or MMX) do not need CPUID.

(2) CPU time stamp counters exist on other processors, e.g. PowerPC and 
Alpha.  While 64-bit counters won't overflow for centuries, machines 
with 32-bit counters (Alpha) experience counter rollovers every few 
seconds.  This must be accounted for in interpreting the results.

(3) Thermal management on recent Pentiums duty cycles the clock whenever 
the processor is in danger of overheating.  The clock frequency does not 
slow down, but the CPU core sees normal/stopped clock with a thermally 
controlled duty cycle.  This begs the question: Is the time stamp 
counter affected?  I hope not, but I do not know for sure.  There are 
reasons for both approaches: Counting either only the cycles seen by the 
core, or counting all clock cycles; I do not know which way Intel's 
engineers went.

(4) SMP processors present a special challenge in using CPU time stamp 
counters for precision timings, because the timing process can migrate 
from one CPU to another, where the counter might have a different 
offset.  Fortunately, Linux checks this on startup, and synchronizes CPU 
time stamp counters when booted.  Provided that all CPUs count the same 
clock cycles, they should remain synchronized, so process migration need 
not affect timings.  However, this can be a serious problem if the 
operating system does not ensure that CPU counters are closely 
synchronized, or if the individual CPU counters drift apart due to some 
reason (e.g. counting thermally duty-cycled clock, as in (3) above).

(5) Precision timings are sometimes available from other sources, e.g. 
some high end network cards (Quadrics elan_clock() returns nanoseconds). 
  Also, MPI specifies portable MPI_Wtime() and MPI_Wtick() calls, but 
again, their actual performance varies by MPI implementation and 
platform (e.g. some MPI implementations use gettimeofday() to return 
MPI_Wtime() in seconds).

(6) All of the above applies to measurements of *local* time only.  The 
notion of *global* time across a cluster is much fuzzier than it ought 
to be, even though it should be technically possible to keep node clocks 
synchronized to microseconds across a cluster.  NTP helps, but even 
better synchronization is needed.  One approach to collecting fine 
grained global timings is to work with local timers, then correct their 
offsets in a post-processing step (this can be tricky, since estimating 
individual offsets assumes that there were enough common timed events 
during the run).



Here is my challenge to the Beowulf and MPI communities: Achieve 
microsecond-level *global* synchronization and tracking of all system 
clocks within a cluster, for application use via portable calls 
gettimeofday() and/or MPI_Wtime().



> Absolutely.  In fact, one of the "fun" things about microbenchmarking is
> that a good microbenchmarking suite and/or a set of good vmstat-like
> tools can really help you understand and "see" the actual costs of stuff
> like this.  Josip Loncaric (on this list) worked very hard some years
> ago to run down a moderately hienous intermittant latency hit in the
> (IIRC) TCP networking stack, for example.  Every now and then instead of
> a predictable relatively short packet latency a really long (again IIRC,
> 2 msec or the like compared to timings on the order of tens of usec) hit
> would appear.

In Linux kernels 2.0.x, this periodic TCP performance hit was 20 ms 
(2000:1), in 2.2.x it was 10 ms (1000:1), but Linux 2.4.x finally seems 
to have fixed this problem without my patches.  While my old web page at 
ICASE went away, history of this Linux TCP quirk can be found at:

http://jl-icase.home.comcast.net/LinuxTCP.html
http://jl-icase.home.comcast.net/LinuxTCP2.html
http://jl-icase.home.comcast.net/LinuxTCP-patches.html

Sincerely,
Josip


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list