opteron VS Itanium 2

Robert G. Brown rgb at phy.duke.edu
Fri Oct 31 11:02:29 EST 2003

On Thu, 30 Oct 2003, Richard Walsh wrote:

> >cache is a problem unless your code is actually a spec component,
> >or unless all machines have the same basic cache-to-working-set relation
> >for each component.  alternative: run each component on a sweep of problem
> >sizes, and derive two scores: in-cache and out-cache.  use both scores 
> >as part of the overall summary statistic.
>  Very good as well.  This is the "cpu-rate-comes-to-spec" approach
>  that I am sure Bob Brown would endorse.

Oh, sure.  "I endorse this." ;-)

As you guys are working out fine on your own, I like it combined with
Mark's suggestion of showing the entire constellation for spec (which of
course you CAN access and SHOULD access in any case instead of relying
on geometric or any other mean measure of performance:-).

I really think that many HPC performance benchmarks primary weakness is
that they DON'T sweep problem size and present results as a graph, and
that they DON'T present a full suite of different results that measure
many identifiably different components of overall performance.  From way
back with early linpack, this has left many benchmarks susceptible to
vendor manipulation -- there are cases on record of vendors (DEC, IIRC,
but likely others) actually altering CPU/memory architecture to optimize
linpack performance because linpack was what sold their systems.  This
isn't just my feeling, BTW -- Larry McVoy has similar concerns (more
stridently expressed) in his lmbench suite -- he actually had (and
likely still has) as a condition of their application to a system that
they can NEVER be applied singly with just one (favorable:-) number or
numbers quoted in a publication or advertisement --- the results of the
complete suite have to be presented all together, with your abysmal
failures side by side with your successes.

I personally am less religious about NEVER doing anything and dislike
semi-closed sources and "rules" even for benchmarks (it makes far more
sense to caveat emptor and pretty much ignore vendor-based performance
claims in general:-), but do think that you get a hell of a lot more
information from a graph of e.g. stream results as a function of vector
size than you get from just "running stream".  Since running stream as a
function of vector size more or less requires using malloc to allocate
the memory and hence adds one additional step of indirection to memory
address resolution, it also very slightly worsens the results, but very
likely in the proper direction -- towards the real world, where people
do NOT generally recompile an application in order to change problem

I also really like Mark's idea of having a benchmark database site where
comparative results from a wide range of benchmarks can be easily
searched and collated and crossreferenced.  Like the spec site,
actually.  However, that's something that takes a volunteer or
organization with spare resources, much energy, and an attitude to make
happen, and since one would like to e.g. display spec results on a
non-spec site and since spec is (or was, I don't keep up with its
"rules") fairly tightly constrained on who can run it and how/where its
results can be posted, it might not be possible to create your own spec
db, your own lmbench db, your own linpack db, all on a public site.
cpu_rate you can do whatever you want with -- it is full GPL code so a
vendor could even rewrite it as long as they clearly note that they
have done so and post the rewritten sources.  Obviously you should
either get results from somebody you trust or run it yourself, but that
is true for any benchmark, with the latter being vastly preferrable.:-)

If I ever have a vague bit of life in me again and can return to
cpu_rate, I'm in the middle of yet another full rewrite that should make
it much easier to create and encapsulate a new code fragment to
benchmark AND should permit running an "antistream" version of all the
tests involving long vectors (one where all the memory addresses are
accessed in a random/shuffled order, to deliberately defeat the cache).
However, I'm stretched pretty thin at the moment -- a talk to give
Tuesday on xmlsysd/wulfstat, a CW column due on Wednesday, and I've
agreed to write an article on yum due on Sunday of next week I think
(and need to finish the yum HOWTO somewhere in there as well).  So it
won't be anytime soon...:-)

> >I'd love to see good data-mining tools for spec results.  for instance,
> >I'd like to have an easy way to compare consecutive results for the same 
> >machine as the vendor changed the compiler, or as clock increases.
>  ... or increased cache size.  Another winning suggestion.
> >there's a characteristic "shape" to spec results - which scores are 
> >high and low relative to the other scores for a single machine.  not only
> >does this include outliers (drastic cache or compiler effects), but
> >points at strengths/weaknesses of particular architectures.  how to do this,
> >perhaps some kind of factor analysis?
>  This is what I refer to as the Spec finger print or Roshacht(sp?)
>  test. We need a neural net derived analysis and classification here. 

<chortle>.  The only one I'd trust is the one already implemented in
wetware.  After all, classification according to what? 

>  Another presentation that I like is the "star graph" in which major 
>  characteristics (floating point perf., integer perf., cache, memory
>  bandwidth, etc.) are layed out in equal degrees as vectors around
>  a circle. Each processor is measured on each axis to give a star
>  print and the total area is a measure of "total goodness".
>  I hope someone from Spec is reading this ... and they remember who
>  made these suggestions ... ;-).

But things are more complicated than this.  The real problem with SPEC
is that your application may well resemble one of the components of the
suite, in which case that component is a decent predictor of performance
for your application almost by definition.  However, the mean
performance on the suite may or may not be well correlated with that
component, or your application may not resemble ANY of the components on
the suite.  Then there are variations with compiler, operating system,
memory configuration, scaling (or lack thereof!) with CPU clock.  As
Mark says, TBBIYOC is the only safe rule if you seek to compare systems
on the basis of "benchmarks".

I personally tend to view large application benchmarks like linpack and
spec with a jaded eye and prefer lmbench and my own microbenchmarks to
learn something about the DETAILED performance of my architecture on
very specific tasks that might be components of a large application,
supplemented with YOC.  Or rather MOC.

Zen question: Which one reflects the performance of an architecture, a
BLAS-based benchmark or an ATLAS-tuned BLAS-based benchmark?


Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list