[Beowulf] ...Re: Benchmark results
Robert G. Brown
rgb at phy.duke.edu
Wed Jan 7 19:46:46 EST 2004
On Wed, 7 Jan 2004, Jim Lux wrote:
> This is along the lines of what that nifty benchmark (the name of which
> eludes me right now) that produces a graph of speed vs computation size
> does.. You can "see" the effects of cache etc as the problems space gets
> bigger. (Edward Tufte would be pleased with the results)
Well, that's one of the things cpu_rate does (the package from which
these timers were lifted. It run one of a variety of stream-like tests
but uses malloc instead of a hard data allocation so that one can vary
the vector size for vector tests. In fact, all the stream tests are
embedded in cpu_rate, even though they are more correctly viewable as
memory tests, or maybe integrated architecture (CPU and memory together)
tests. cpu_rate also contains the old savage benchmark and a stream
extension that includes division (division is typically MUCH SLOWER than
multiplication and addition, making it "unsexy" to include in a
benchmark designed to show how fast your system is, although I find that
my own code has plenty of divisions in it:-).
It doesn't provide a graph per se -- it provides a bogomegarate.
However, there is a script included that will usually generate a file of
rates at a range of vector sizes that can then easily be graphed. All
available on brahma or my personal website.
lmbench is also a very good microbenchmark, by the way -- arguably
better than cpu_rate. Larry has been in the benchmark business longer
and Carl Staelin also is very smart. For a while, though, Larry had
some "odd" rules on the benchmark (designed not to close the source but
rather to keep vendors from corrupting its application to their
systems). I just prefer a straight GPL, so I developed a VERY old
"FLOPS" benchmark suite of mine (dating back to the late 80's or early
90's -- I think I originally wrote it using DeSmet C if any old PC users
recognize it:-) into a microbenchmark I could use and publish without
> But, except for diagnostic purposes, code tuning, research into the
> performance of CPU architectures, and the like, are such nanoscale
> measurements really useful. The real figure of merit is "how long does it
> take to run my problem" and that's a wall clock time issue. Given a
> suitably scalable problem, is it worth spending much time making the code
> 10% faster, or are you better off just adding 10% more CPU resources?
Agreed. Although ATLAS is a pretty clear demonstration that sometimes
it's a matter of 200-300% faster. Algorithm and tuning CAN matter a
lot, and it is worth learning "enough" to know whether that is likely
to be the case for your code. Just something like inverting a value so
you can multiply it many times inside a core loop instead of divide it
many times inside a core loop can make a BIG difference in core loop
> Folks who are squeezing out the last operation per clock cycle typically
> use external measurement equipment (i.e. a logic analyzer or oscilloscope),
> because you wouldn't want to waste any precious cycles on internally
> instrumenting the code. We did a substantial amount of debugging on a
> loosely coupled cluster of DSPs just looking at the supply current to the
> processors on a scope, since instantaneous power consumption is very
> tightly tied to processor utilization. You can see the end of each pass of
> butterflies in the FFT (very high current) as you set up the next pass(low
> current), and we could see interrupt handling (low current) at inopportune
> times causing context switching in the middle of a dense pipelined code
> section (because we hadn't masked the interrupt).
Now THAT'S serious performance optimization.
> Those non-exportable video games with Motorola 68000's were "arms" or
> "munitions", or even "dual use technologies" , not "weapons"... Such
> distinctions are important to the folks at Commerce and State. Besides,
> being an "international arms dealer" has a much better sound to it than
> "gun runner". Then too, such formerly silly rules helped create news
> stories about "bad guys" building immense clusters from networked
> Playstation 2s. I suspect nobody's ever successfully done this; at least to
> a practical state, running any sort of useful parallel processing. There
> was a lot of interest in it for a while, and if you go back through the
> archives, you can probably find it.
They were just worried that bad persons/countries would use the systems
to simulate critical nuclear cores so that they could build
small/efficient devices without having to actually test them. Correctly
worried -- I imagine that my laptop would work just fine to simulate at
least some aspects of a nuclear chain reaction, given the right code.
I personally never worried much about this. The nuclear bomb is 1940's
technology. Give me access to U235 or plutonium and I'll build one.
Moderately efficient non-implosion uranium bombs are trivial to build --
literally garage stuff. Implosion/plutonium bombs are more difficult --
explosive lenses and so forth -- but c'mon, this is 2004. Saddam had
functional test pits at the end of the FIRST gulf war -- he just didn't
have the nuclear grade material (from what was reported at the time).
Conventional explosives are pretty well known by now.
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf