New cluster benchmark proposal (Re: top500 list)
Robert G. Brown
rgb at phy.duke.edu
Tue Nov 18 09:21:23 EST 2003
On Tue, 18 Nov 2003, Jakob Oestergaard wrote:
> On Mon, Nov 17, 2003 at 01:36:28PM -0800, Bill Broadley wrote:
> > After all this discussion of the top 500 list, it got me thinking about a
> > "better" benchmark. Where "better" means more useful to evaluating my
> > idea of cluster goodness.
> There are lies, damn lies, and statistics...
> Your points about a more appropriate benchmark are valid - but we must
> realize that there is not such thing as "the one true benchmark".
> Some clusters are tailored for one specific workload - one app. that has
> been written for the cluster, as the cluster was built for the app. In
> those situations, you can run that app on the cluster and get your "true
> performance" metric.
I agree and disagree. I personally have a deep and abiding mistrust of
high end benchmarks -- benchmarks of complex code -- unless they are MY
complex code. Things like linpack and spec are useful only to the
extent that one or more components "resembles" your application. Screw
resemblance -- test your application.
However, I think Bill's points are very well taken, so much so that I
saved the article in my "List Ideas" directory for eventual
reconsideration and mention in an article or the book.
I also think that MICROBENCHMARKS are very useful indeed to systems and
cluster engineers. Things like lmbench or stream or netpipes are small
(generally nearly trivial code) and relatively insensitive to
compiler/architecture quirks, or at least if they are they are likely to
be sensitive in ways that do translate to arbitrary applications that
use the tested operations.
They are also a LOT harder to "fool", especially if the microbenchmarks
can be run by anybody from a GPL source base. The vendor cannot easily
fudge a benchmark if you put your benchmark source on a vanilla Linux
install, compile it, and run it. Or again, if they do "fudge" somehow
under those circumstances (perhaps by warping an entire architecture to
optimize some result:-) it is likely that a real application will
benefit from the optimized operation, even if other operations elsewhere
The latter sort of tradeoff is why Larry McVoy insists that lmbench
(which can be run, of course, any way a user likes, a microbenchmark at
a time) can only be used to publish >>results<< if a full suite of
results are published, not "selected" ones on which a vendor does well.
This is intended to prevent the kind of abuse that early benchmarks were
notorious for attracting (and that likely continues today). Chip real
estate ALSO goes through various opportunity cost decisioning processes
(re: previous post on grant processes:-) and a new LU to optimize
process X comes at the expense of e.g. on-chip context storage, more
registers, heat production and hence higher clock. At some point you
are robbing peter to pay paul, and the issue becomes one of balance.
The balance issue extends out to the rest of the architecture, as has
increasingly been a list focus. CPU clock has consistently outpaced
memory (in Moore's Law exponent); both have WAY outpaced the network.
Disk has outpaced even the CPU in volume, but lagged even the network in
So I personally would like to see a full suite of microbenchmarks --
literally trivial components wrapped in a timing harness. These should
measure core functions that are building blocks of real programs. Many
of these computational component measurements exist for standalone
systems; not so many for clustered systems. I think this is the
intriguing element of Bill's suggestions. A benchmark graph of just how
long it takes to use raw UDP or TCP sockets, MPI, PVM to pass a message
according to one of several patterns, plotted as a 2d/3d function of
e.g. message size and number of nodes, together with stream results
(and perhaps some of the other cpu_rate or lmbench benchmarks, depending
on your arithmetic mix) would be a lot more openly informative than what
gets published now.
For one thing, it would separate out a lot of the bullshit associated
with "top 500-ness". We could look at two clusters and compare their
actual performance in important metrics at a glance, instead of
wondering who could possibly give a rodent's furry behind about tools
that de facto are just ONE possible measure of aggregate CPU in ONE set
of fairly complex operations out of a practical infinity that might
actually occur in our code.
> For most of the top machines, I'd be rather surprised if there hadn't
> been a pretty clear idea about what the machines would be running, prior
> to purchase.
;-) I think you're right...
> I think that having one poor (but well known and simple) metric is the
> better solution.
It does make it simple, but it doesn't make it better. It's the old
issue -- "how many MFLOPS -> GFLOPS -> TFLOPS is your cluster?" (arrows
indicate the progress of roughly decades). Who's di..um, I mean
"cluster" is bigger.
First, tell me what the HELL a MFLOP is. My microbenchmark measurements
of a MFLOP don't agree with any of the accepted definitions, and vary
significantly with whether or not e.g. division is included in the
"floating point operations" tested. Since division is so slow, it is
almost always omitted from computations of FLOPS. Since division is so
common, people wonder why even their simple loops with division in them
don't ever achieve the blazing throughput they expected. Then there are
the rather immense variations in performance observed as e.g. the size
of vectors is varied, code is driven from local/sequential to
Cluster engineers are not stupid. Well, maybe SOME of them are stupid,
somewhere, but I haven't met any that happened to be drooling and
looking off in the distance with a vacant expression. Unless a beer
happened to be sitting in front of them, of course. I think that they
could manage to learn to use a very complex (but well documented, GPL)
instrument set to support intelligent cluster design. Hell, I think
most of the good people on this list use a complex but NOT terribly
integrated set to support intelligent cluster design now! As I said,
stream, netpipes, even spec (there ARE people whose tasks match
decently with at least one component). And of course, the best of
benchmarks, your application, but >>even optimizing your application<<
requires knowledge only a microbenchmark can provide.
The benefits of using this sort of information intelligently can equal
the output of your entire cluster put together. Dongarra's ATLAS
project is a shining beacon for what can be done in this regard.
Factors of 2-3 speedup are not unknown for what CAN be core operations
in many computations, just automagically adjusting algorithm and stride
to take maximal advantage of register/L1/L2/memory latencies and
bandwidths and the underlying CPU/chipset. It is pretty much the ONLY
way one can achieve superlinear speedup -- know where significant
nonlinearities in bottleneck speed occur and partition the task
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf