opteron VS Itanium 2
rbw at ahpcrc.org
Thu Oct 30 16:32:38 EST 2003
Mark Hahn wrote:
>> this fact leads us back to the idea that cache >>is<< important for a suite
>> of "representative codes".
>yes, certainly, and TBBIYOC (*). but the traditional perhaps slightly
>stodgy attitude towards this has been that caches do not help machine
>balance. that is, it2 has a peak/theoretical 4flops/cycle, but since
>that would require, worstcase, 3 doubles per flop, the highest-ranked
>CPU is actually imbalanced by a factor of 22.5!
>(*) the best benchmark is your own code
Agreed, but since the scope of the discussion seemed to be microprocessors
which are all relatively bad on balance compared to vector ISA/designs,
I did not elaborate on balance. This is design area that favors the
Opteron (and Power 4) because the memory controller is on-chip (unlike
the Pentium 4 and I2) and as such, its performance improves with clock.
I think it is interesting to look at other processor's theoretical balance
numbers in relationship to the I2's that you compute (I hope I have
them all correct):
Pentium 4 EE 3.2 GHz:
(3.2 GHz * 2 flops * 24 bytes) / 6.4 bytes/sec = Balance of 24
(max on chip cache 2MB)
Itanium 2 1.5 GHz:
(1.5 GHz * 4 flops * 24 bytes) / 6.4 bytes/sec = Balance of 22.5
(max on chip cache 6MB)
Opteron 246 2.0 GHz:
(2.0 GHz * 2 flops * 24 bytes) / 6.4 bytes/sec = Balance of 15
(max on chip cache 1MB)
Power 4 1.7 GHz:
(1.7 GHz * 4 flops * 24 bytes) / 6.4 bytes/sec = Balance of 25.5*
(max on chip cache 1.44MB)
Cray X1 .8 GHz:
(0.8 GHz * 4 flops * 24 bytes) / 19.2 bytes/sec = Balance of 4
(512 byte off-chip L2)
* IBM memory performance is with 1 core disabled and may now be higher
When viewed in context, yes, the I2 is poorly balanced, but it is typical
of microprocessors, and it is not the worst among them. It also offers the
largest compensating cache. Where it loses alot of ground is in the dual
processor configuration. Opteron yields a better number, but this is
because it can't do as many flops. The Cray X1 is has the most agressive
design specs and yields a large enough percentage of peak to beat the
fast clocked micros on vector code (leaving the ugly question of price aside).
This is in part due to the more balanced design, but also due to its vector
ISA which is just better at moving data from memory.
>let's step back a bit. suppose we were designing a new version of SPEC,
>and wanted to avoid every problem that the current benchmarks have.
>here are some partially unworkable ideas:
>keep geometric mean, but also quote a few other metrics that don't
>hide as much interesting detail. for instance, show the variance of
>scores. or perhaps show base/peak/trimmed (where the lowest and highest
>component are simply dropped).
Definitely. I am constantly trimming the reported numbers myself and
looking at the bar graphs for an eye-ball variance. It takes will
power to avoid being seduced by a single summarizing number. The
Ultra III's SpecFP number was a good reminder.
>cache is a problem unless your code is actually a spec component,
>or unless all machines have the same basic cache-to-working-set relation
>for each component. alternative: run each component on a sweep of problem
>sizes, and derive two scores: in-cache and out-cache. use both scores
>as part of the overall summary statistic.
Very good as well. This is the "cpu-rate-comes-to-spec" approach
that I am sure Bob Brown would endorse.
>I'd love to see good data-mining tools for spec results. for instance,
>I'd like to have an easy way to compare consecutive results for the same
>machine as the vendor changed the compiler, or as clock increases.
... or increased cache size. Another winning suggestion.
>there's a characteristic "shape" to spec results - which scores are
>high and low relative to the other scores for a single machine. not only
>does this include outliers (drastic cache or compiler effects), but
>points at strengths/weaknesses of particular architectures. how to do this,
>perhaps some kind of factor analysis?
This is what I refer to as the Spec finger print or Roshacht(sp?)
test. We need a neural net derived analysis and classification here.
Another presentation that I like is the "star graph" in which major
characteristics (floating point perf., integer perf., cache, memory
bandwidth, etc.) are layed out in equal degrees as vectors around
a circle. Each processor is measured on each axis to give a star
print and the total area is a measure of "total goodness".
I hope someone from Spec is reading this ... and they remember who
made these suggestions ... ;-).
# Richard Walsh
# Project Manager, Cluster Computing, Computational
# Chemistry and Finance
# netASPx, Inc.
# 1200 Washington Ave. So.
# Minneapolis, MN 55415
# VOX: 612-337-3467
# FAX: 612-337-3400
# EMAIL: rbw at networkcs.com, richard.walsh at netaspx.com
# rbw at ahpcrc.org
# Nullum magnum ingenium sine mixtura dementiae fuit.
# - Seneca
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf