[Beowulf] ...Re: Benchmark results
josip at lanl.gov
Wed Jan 7 13:41:58 EST 2004
Robert G. Brown wrote:
> A Brief Digression on Benchmark Timing
> (Yes, I lie like a rug. It is not brief. If you already know about
> timing, or don't care, hit 'd' now...:-)
Thank you for providing this excellent article. I'd just like to add a
few details on fine grained timings, which are indeed useful in tracking
down some performance quirks...
> gettimeofday is actually a mediocre timer for nearly all
> microbenchmarking purposes
While gettimeofday() is quite portable as a function call, its timing
accuracy is not. On some machines (e.g. Linux PCs) the times are
reliable to within a few microseconds. On other systems, true
granularity can be 1000 microseconds, apparently related to the
scheduler frequency. This seriously undermines the portability of fine
grained timings done with gettimeofday().
> A much better timer is available on nearly
> all intel/amd cpus.
Indeed -- and thank you for your code examples. One minor simplifying
comment (I learned this recently) is that GNU C has the ability to read
the full 64-bit time stamp counter as follows:
unsigned long long tsc;
asm volatile ("rdtsc": "=A" (tsc));
where the capitalized "=A" tells gcc that the operand constraints of
RDTSC involve both a and d registers, as in:
"A -- Specifies the a or d registers. This is primarily useful for
64-bit integer values (when in 32-bit mode) intended to be returned with
the d register holding the most significant bits and the a register
holding the least significant bits."
Regarding the RDTSC instruction, those interested in more detail can consult
but I'd like to add a few observations:
(1) Pentiums execute instructions out-of-order. Although RDTSC
instruction will read the 64-bit Time Stamp Counter which advances at
the CPU's clock frequency (~3GHz these days), the meaning of the count
might not be exactly what you expected. For maximum reliability, one
can issue a serializing instruction such as CPUID before RDTSC, at the
cost of about 300 clock cycles (CPUID can take variable amount of time
to execute). BTW, just doing the RDTSC and moving its result to other
registers takes about 86 cycles on a Pentium III. Intel's document
suggests some compensation techniques, but in many cases 300 cycle
resolution (about 100 nanoseconds) is already OK, so CPUID is optional.
Also, older Pentiums (plain or MMX) do not need CPUID.
(2) CPU time stamp counters exist on other processors, e.g. PowerPC and
Alpha. While 64-bit counters won't overflow for centuries, machines
with 32-bit counters (Alpha) experience counter rollovers every few
seconds. This must be accounted for in interpreting the results.
(3) Thermal management on recent Pentiums duty cycles the clock whenever
the processor is in danger of overheating. The clock frequency does not
slow down, but the CPU core sees normal/stopped clock with a thermally
controlled duty cycle. This begs the question: Is the time stamp
counter affected? I hope not, but I do not know for sure. There are
reasons for both approaches: Counting either only the cycles seen by the
core, or counting all clock cycles; I do not know which way Intel's
(4) SMP processors present a special challenge in using CPU time stamp
counters for precision timings, because the timing process can migrate
from one CPU to another, where the counter might have a different
offset. Fortunately, Linux checks this on startup, and synchronizes CPU
time stamp counters when booted. Provided that all CPUs count the same
clock cycles, they should remain synchronized, so process migration need
not affect timings. However, this can be a serious problem if the
operating system does not ensure that CPU counters are closely
synchronized, or if the individual CPU counters drift apart due to some
reason (e.g. counting thermally duty-cycled clock, as in (3) above).
(5) Precision timings are sometimes available from other sources, e.g.
some high end network cards (Quadrics elan_clock() returns nanoseconds).
Also, MPI specifies portable MPI_Wtime() and MPI_Wtick() calls, but
again, their actual performance varies by MPI implementation and
platform (e.g. some MPI implementations use gettimeofday() to return
MPI_Wtime() in seconds).
(6) All of the above applies to measurements of *local* time only. The
notion of *global* time across a cluster is much fuzzier than it ought
to be, even though it should be technically possible to keep node clocks
synchronized to microseconds across a cluster. NTP helps, but even
better synchronization is needed. One approach to collecting fine
grained global timings is to work with local timers, then correct their
offsets in a post-processing step (this can be tricky, since estimating
individual offsets assumes that there were enough common timed events
during the run).
Here is my challenge to the Beowulf and MPI communities: Achieve
microsecond-level *global* synchronization and tracking of all system
clocks within a cluster, for application use via portable calls
gettimeofday() and/or MPI_Wtime().
> Absolutely. In fact, one of the "fun" things about microbenchmarking is
> that a good microbenchmarking suite and/or a set of good vmstat-like
> tools can really help you understand and "see" the actual costs of stuff
> like this. Josip Loncaric (on this list) worked very hard some years
> ago to run down a moderately hienous intermittant latency hit in the
> (IIRC) TCP networking stack, for example. Every now and then instead of
> a predictable relatively short packet latency a really long (again IIRC,
> 2 msec or the like compared to timings on the order of tens of usec) hit
> would appear.
In Linux kernels 2.0.x, this periodic TCP performance hit was 20 ms
(2000:1), in 2.2.x it was 10 ms (1000:1), but Linux 2.4.x finally seems
to have fixed this problem without my patches. While my old web page at
ICASE went away, history of this Linux TCP quirk can be found at:
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf