Cluster benchmark summary

Bill Broadley bill at
Wed Nov 26 04:00:08 EST 2003

Greetings all.  Many thanks for the many responses.

My main frustration and motivation for my benchmark proposal was the
relatively poor relationship between advertised link latencies and
bandwidths and actual application level scaling/efficiency.

Turns out this was a popular topic at SC2003, I discussed it with many
people, and attended a few discussions on it.

McCalpin had a talk discussing (from memory, so expect inaccuracies)
how MFlops predict spec cpu_rate (very poorly) and how memory bandwidth
predicts cpu_rate (less poorly).  He then discussed a hybrid model using
MFlops, cache size, and memory bandwidth.  Something along the lines of
0.8 bytes per flop with zero cache and 0.1 bytes per flop with 8MB of
cache was used for a model to predict Spec cpu_rate based on MFlops, and
memory bandwidth.  Using this fairly simple model Mflops * cachefficency *
bandwidth led to a pretty good correlation with the 900+ spec cpu_rate
numbers he collected (my vague memory wants to claim +/- 10 or 20%).
Interpreted by me as somewhat of a validation of microbenchmarks acting
as a predictor of real application performance (for applications that
are well understood).  If I find the slides online I'll post (if someone
else does please follow up).  At least he has a convincing graph (I know,
a great way to lie) on his predictions for 923 spec_cpu_rate results.

The most noteworthy benchmark suite mentioned at SC2003 was:

Basically 5 benchmarks (well 4 + the top-500 HP Linpack) to help
quantify cluster performance and scaling.  McCalpin's stream, Random
Access (I believe I heard this referred to as something that sounded
liked Gups), Ptrans(parallel matrix transpose), and b_eff (effective
bandwidth benchmark).

Current version is at 0.4 alpha, so here is your chance people, improve it
while you can.  I'm assuming that input is welcome, and patches doubly so.

I think this is a great start.

Currently submitted results are for a Cray (vector), Alphaserver,
Itanium2, Altix, and Power 4 based clusters.  I'd love to see additional
numbers for Myrinet, Dolphin, Quadrics and Infiniband clusters.
Submit yours today!

Oh and most importantly (no Spec mistakes here), source is available,
so have at it and report results (click on archive or upload).  I have
no idea what the license status of the source is, it is available for
download but doesn't mention any licensing terms.  Ideally it will be
GPL or similar.

I believe source code optimization is legal AFTER reporting based
unmodified results.  I also believe that ALL results most be posted,
mainly to avoiding cherry picking.  Of course the URL mentioned is the
authorative source for such info.

I heard rumors from several different people that was going
to collect these performance numbers, but still rank only on HPL.  Of
course people can download the results and rank however they want.

So hopefully this will lead to interconnect companies competing on
complete cluster performance instead of link speeds and latencies.  


I'll list here any other benchmarks people brought to my attention,
please follow up if I missed anything, many of the messages came
in at the conference after my ethernet and wireless died (damn 
Dell laptop), of course upon my return I've been swamped with email.

Felix Rauch mentioned the Switchmark discussed in a paper at:

A collection of benchmarks, mentioned at SC2003, is available at:

John Hearns mentioned: (beowulf performance suite)
More related discussion at:

Bill Broadley
Computational Science and Engineering
UC Davis
Beowulf mailing list, Beowulf at
To change your subscription (digest mode or unsubscribe) visit

More information about the Beowulf mailing list