[Beowulf] ...Re: Benchmark results
James.P.Lux at jpl.nasa.gov
Wed Jan 7 13:10:45 EST 2004
At 11:18 AM 1/7/2004 -0500, Robert G. Brown wrote:
>A Brief Digression on Benchmark Timing
<snipping a small section of the brief digression>
>Suffice it to say that if you time LOTS of 100 microsecond loops
>separately in a continuous process and look at the results, you're
>likely to see the bimodal distribution Joe refers to (or even a fully
>multimodal distribution) because SOME of those intervals will be
>interrupted and the interrupt/context switch will be "timed" along with
>the swapped out loop by the wall-clock timer. This can be somewhat
>ameliorated by adding a usleep right before each timing loop iteration,
>as the kernel will often take advantage of a task pause to complete
>pending interrupts. However, it alas isn't a magic bullet because not
>all interrupts are "pendable".
That very distribution might be useful as a diagnostic tool, by the way...
>This nanotimer is FAR more suitable for use in microbenchmarking --
>relatively small loops where one devoutly hopes to avoid getting nailed
>by a context switch as one is interested in "how many floating point
>multiplies can the system perform on numbers in its registers/L1
>cache/L2 cache/memory per second". A loop with a few microseconds worth
>of operations, fairly likely not to be interrupted if you start
>immediately after a timer click, can yield < 1% accuracy. Multimodality
>in a distribution of such short loop timings yields "interesting"
>information about the system's CPU, cache, memory architecture, and
>their interaction with the kernel.
This is along the lines of what that nifty benchmark (the name of which
eludes me right now) that produces a graph of speed vs computation size
does.. You can "see" the effects of cache etc as the problems space gets
bigger. (Edward Tufte would be pleased with the results)
>I've wrapped both of the timers up in a single file in a somewhat
>portable form. Because there can be a LOT of clocks in a second
>(several billion, obviously) the nanotimer is a bit complex --
>on the first call it saves the base clock registers as a baseline that
>it subtracts from subsequent calls BEFORE converting the results to a
>double precision nanosecond count. Done in the wrong order, you can end
>up with a significant precision loss subtracting two large numbers to
>get a very small number, hence the runaround. I'm pretty sure it is
>done right at this point. I attach the timer source below for anyone
But, except for diagnostic purposes, code tuning, research into the
performance of CPU architectures, and the like, are such nanoscale
measurements really useful. The real figure of merit is "how long does it
take to run my problem" and that's a wall clock time issue. Given a
suitably scalable problem, is it worth spending much time making the code
10% faster, or are you better off just adding 10% more CPU resources?
> > You wouldn't even need another process taking up time, the kernel will
> > happily do this for you with some of its housekeeping chores (bdflush,
> > kswapd, ...). Servicing lots of interrupts is a good way to slow things
> > down, as are doing many small page allocations where one large one would
> > suffice, etc.
>Absolutely. In fact, one of the "fun" things about microbenchmarking is
>that a good microbenchmarking suite and/or a set of good vmstat-like
>tools can really help you understand and "see" the actual costs of stuff
>like this. Josip Loncaric (on this list) worked very hard some years
>ago to run down a moderately hienous intermittant latency hit in the
>(IIRC) TCP networking stack, for example. Every now and then instead of
>a predictable relatively short packet latency a really long (again IIRC,
>2 msec or the like compared to timings on the order of tens of usec) hit
This sort of nondeterministic latency is the bane of high performance
computing, particularly for those of us with real-time or embedded
applications. In these, you typically need to have an absolute bound on
>Sending lots of small packets is "bad". Doing lots of malloc is "bad".
>Doing any sort of interrupt is bad, in part because the system has a
>tendency to swap in waiting/competing tasks (if any) when the interrupt
>returns and of course because of the relatively high cost of the
>interrupt itself. Lots of system calls can have the same effect --
>select, for example, has about the same cost as gettimeofday, and AFAICT
>nanosleep is really usleep -- the ONLY way I've found to "pause" for a
>nanoscale-specified time is to use a nanotimer-based polling loop.
It might be "bad" in the overall performance sense, but it might be "good"
in the reduction of latency variability standpoint. If you had a very
tightly coupled parallel stream with lots of barrier synchronization, the
overall problem might actually run faster with a longer, but more
consistent, or more fine grained, latencies, rather than something that
usually runs real fast, but occasionally takes a lot longer.
Things like heaps and garbage collection are notorious for this kind of thing.
>FORTUNATELY, most of this isn't horribly difficult to understand or to
>avoid in actual code, and FORTUNATELY most people really are interested
>in macroscopic wall-clock completion times of large scale code, where
>"date" at the shell is accurate enough to get decent timings. Where it
>makes a difference is in optimizing core loops or understanding how you
>spent all this money on a 3 GHz CPU system that IN PRINCIPLE has a 6
>GFLOPS peak but in practice you are lucky to get 1 GFLOPS on a good day
>with a tail wind.
Folks who are squeezing out the last operation per clock cycle typically
use external measurement equipment (i.e. a logic analyzer or oscilloscope),
because you wouldn't want to waste any precious cycles on internally
instrumenting the code. We did a substantial amount of debugging on a
loosely coupled cluster of DSPs just looking at the supply current to the
processors on a scope, since instantaneous power consumption is very
tightly tied to processor utilization. You can see the end of each pass of
butterflies in the FFT (very high current) as you set up the next pass(low
current), and we could see interrupt handling (low current) at inopportune
times causing context switching in the middle of a dense pipelined code
section (because we hadn't masked the interrupt).
>[Parenthetical Aside: Youngsters don't know HOW lucky you are to get
>that 1 GFLOPS in a $1000 system, of course. Suffice it to say that as
>of a few years ago, your desktop would have been classified as a weapon
>for purposes of export, and sending one outside of the US would have
>taken a special license, and a decade ago a GFLOPS system (even a
>cluster) would have cost order of $1 million. Gotta love Moore's Law
>(and the fact that Our Government, helped enormously by the beowulf
>movement, finally wised up to Moore's Law and moved any remaining
>weapons boundaries out so far that they are invisible).]
Those non-exportable video games with Motorola 68000's were "arms" or
"munitions", or even "dual use technologies" , not "weapons"... Such
distinctions are important to the folks at Commerce and State. Besides,
being an "international arms dealer" has a much better sound to it than
"gun runner". Then too, such formerly silly rules helped create news
stories about "bad guys" building immense clusters from networked
Playstation 2s. I suspect nobody's ever successfully done this; at least to
a practical state, running any sort of useful parallel processing. There
was a lot of interest in it for a while, and if you go back through the
archives, you can probably find it.
>James Lux, P.E.
Spacecraft Telecommunications Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf