[Beowulf] ...Re: Benchmark results

Wed Jan 7 19:46:46 EST 2004

On Wed, 7 Jan 2004, Jim Lux wrote:

> This is along the lines of what that nifty benchmark (the name of which 
> eludes me right now) that produces a graph of speed vs computation size 
> does.. You can "see" the effects of cache etc as the problems space gets 
> bigger. (Edward Tufte would be pleased with the results)

Well, that's one of the things cpu_rate does (the package from which
these timers were lifted.  It run one of a variety of stream-like tests
but uses malloc instead of a hard data allocation so that one can vary
the vector size for vector tests.  In fact, all the stream tests are
embedded in cpu_rate, even though they are more correctly viewable as
memory tests, or maybe integrated architecture (CPU and memory together)
tests.  cpu_rate also contains the old savage benchmark and a stream
extension that includes division (division is typically MUCH SLOWER than
multiplication and addition, making it "unsexy" to include in a
benchmark designed to show how fast your system is, although I find that
my own code has plenty of divisions in it:-).

It doesn't provide a graph per se -- it provides a bogomegarate.
However, there is a script included that will usually generate a file of
rates at a range of vector sizes that can then easily be graphed.  All
available on brahma or my personal website.

lmbench is also a very good microbenchmark, by the way -- arguably
better than cpu_rate.  Larry has been in the benchmark business longer
and Carl Staelin also is very smart.  For a while, though, Larry had
some "odd" rules on the benchmark (designed not to close the source but
rather to keep vendors from corrupting its application to their
systems).  I just prefer a straight GPL, so I developed a VERY old
"FLOPS" benchmark suite of mine (dating back to the late 80's or early
90's -- I think I originally wrote it using DeSmet C if any old PC users
recognize it:-) into a microbenchmark I could use and publish without
care.

> But, except for diagnostic purposes, code tuning, research into the 
> performance of CPU architectures, and the like,  are such nanoscale 
> measurements really useful.  The real figure of merit is "how long does it 
> take to run my problem" and that's a wall clock time issue.  Given a 
> suitably scalable problem, is it worth spending much time making the code 
> 10% faster, or are you better off just adding 10% more CPU resources?

Agreed.  Although ATLAS is a pretty clear demonstration that sometimes
it's a matter of 200-300% faster.  Algorithm and tuning CAN matter a
lot, and it is worth learning "enough" to know whether that is likely
to be the case for your code.  Just something like inverting a value so
you can multiply it many times inside a core loop instead of divide it
many times inside a core loop can make a BIG difference in core loop
timing...

> Folks who are squeezing out the last operation per clock cycle typically 
> use external measurement equipment (i.e. a logic analyzer or oscilloscope), 
> because you wouldn't want to waste any precious cycles on internally 
> instrumenting the code.  We did a substantial amount of debugging on a 
> loosely coupled cluster of DSPs just looking at the supply current to the 
> processors on a scope, since instantaneous power consumption is very 
> tightly tied to processor utilization.  You can see the end of each pass of 
> butterflies in the FFT (very high current) as you set up the next pass(low 
> current), and we could see interrupt handling (low current) at inopportune 
> times causing context switching in the middle of a dense pipelined code 
> section (because we hadn't masked the interrupt).

Now THAT'S serious performance optimization.  

> Those non-exportable video games with Motorola 68000's were "arms" or 
> "munitions", or even "dual use technologies" , not "weapons"... Such 
> distinctions are important to the folks at Commerce and State.  Besides, 
> being an "international arms dealer" has a much better sound to it than 
> "gun runner".  Then too, such formerly silly rules helped create news 
> stories about "bad guys" building immense clusters from networked 
> Playstation 2s. I suspect nobody's ever successfully done this; at least to 
> a practical state, running any sort of useful parallel processing. There 
> was a lot of interest in it for a while, and if you go back through the 
> archives, you can probably find it.

They were just worried that bad persons/countries would use the systems
to simulate critical nuclear cores so that they could build
small/efficient devices without having to actually test them. Correctly
worried -- I imagine that my laptop would work just fine to simulate at
least some aspects of a nuclear chain reaction, given the right code.

I personally never worried much about this.  The nuclear bomb is 1940's
technology.  Give me access to U235 or plutonium and I'll build one.
Moderately efficient non-implosion uranium bombs are trivial to build --
literally garage stuff.  Implosion/plutonium bombs are more difficult --
explosive lenses and so forth -- but c'mon, this is 2004.  Saddam had
functional test pits at the end of the FIRST gulf war -- he just didn't
have the nuclear grade material (from what was reported at the time).
Conventional explosives are pretty well known by now.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf