opteron VS Itanium 2

Thu Oct 30 16:32:38 EST 2003

Mark Hahn wrote:

>>  this fact leads us back to the idea that cache >>is<< important for a suite
>>  of "representative codes".
>
>yes, certainly, and TBBIYOC (*).  but the traditional perhaps slightly 
>stodgy attitude towards this has been that caches do not help machine
>balance.  that is, it2 has a peak/theoretical 4flops/cycle, but since 
>that would require, worstcase, 3 doubles per flop, the highest-ranked 
>CPU is actually imbalanced by a factor of 22.5!
>
>(*) the best benchmark is your own code

 Agreed, but since the scope of the discussion seemed to be microprocessors
 which are all relatively bad on balance compared to vector ISA/designs,
 I did not elaborate on balance. This is design area that favors the 
 Opteron (and Power 4) because the memory controller is on-chip (unlike 
 the Pentium 4 and I2) and as such, its performance improves with clock.

 I think it is interesting to look at other processor's theoretical balance 
 numbers in relationship to the I2's that you compute (I hope I have
 them all correct):

 Pentium 4 EE 3.2 GHz:

  (3.2 GHz * 2 flops * 24 bytes) / 6.4 bytes/sec =  Balance of 24
                                               (max on chip cache 2MB)

 Itanium 2 1.5 GHz:

  (1.5 GHz * 4 flops * 24 bytes) / 6.4 bytes/sec =  Balance of 22.5
                                               (max on chip cache 6MB)

 Opteron 246 2.0 GHz:

  (2.0 GHz * 2 flops * 24 bytes) / 6.4 bytes/sec =  Balance of 15
                                               (max on chip cache 1MB)

 Power 4 1.7 GHz:

  (1.7 GHz * 4 flops * 24 bytes) / 6.4 bytes/sec =  Balance of 25.5*
                                               (max on chip cache 1.44MB)

 Cray X1 .8 GHz:

  (0.8 GHz * 4 flops * 24 bytes) / 19.2 bytes/sec = Balance of 4
                                                (512 byte off-chip L2)

 * IBM memory performance is with 1 core disabled and may now be higher
   than this.

 When viewed in context, yes, the I2 is poorly balanced, but it is typical 
 of microprocessors, and it is not the worst among them. It also offers the 
 largest compensating cache. Where it loses alot of ground is in the dual
 processor configuration.  Opteron yields a better number, but this is 
 because it can't do as many flops.  The Cray X1 is has the most agressive 
 design specs and yields a large enough percentage of peak to beat the 
 fast clocked micros on vector code (leaving the ugly question of price aside).  
 This is in part due to the more balanced design, but also due to its vector 
 ISA which is just better at moving data from memory.

>let's step back a bit.  suppose we were designing a new version of SPEC,
>and wanted to avoid every problem that the current benchmarks have.
>here are some partially unworkable ideas:
>
>keep geometric mean, but also quote a few other metrics that don't
>hide as much interesting detail.  for instance, show the variance of 
>scores.  or perhaps show base/peak/trimmed (where the lowest and highest
>component are simply dropped).

 Definitely. I am constantly trimming the reported numbers myself and 
 looking at the bar graphs for an eye-ball variance.  It takes will 
 power to avoid being seduced by a single summarizing number.  The 
 Ultra III's SpecFP number was a good reminder.

>cache is a problem unless your code is actually a spec component,
>or unless all machines have the same basic cache-to-working-set relation
>for each component.  alternative: run each component on a sweep of problem
>sizes, and derive two scores: in-cache and out-cache.  use both scores 
>as part of the overall summary statistic.

 Very good as well.  This is the "cpu-rate-comes-to-spec" approach
 that I am sure Bob Brown would endorse.

>I'd love to see good data-mining tools for spec results.  for instance,
>I'd like to have an easy way to compare consecutive results for the same 
>machine as the vendor changed the compiler, or as clock increases.

 ... or increased cache size.  Another winning suggestion.

>there's a characteristic "shape" to spec results - which scores are 
>high and low relative to the other scores for a single machine.  not only
>does this include outliers (drastic cache or compiler effects), but
>points at strengths/weaknesses of particular architectures.  how to do this,
>perhaps some kind of factor analysis?

 This is what I refer to as the Spec finger print or Roshacht(sp?)
 test. We need a neural net derived analysis and classification here. 

 Another presentation that I like is the "star graph" in which major 
 characteristics (floating point perf., integer perf., cache, memory
 bandwidth, etc.) are layed out in equal degrees as vectors around
 a circle. Each processor is measured on each axis to give a star
 print and the total area is a measure of "total goodness".

 I hope someone from Spec is reading this ... and they remember who
 made these suggestions ... ;-).

 Regards,

 rbw

#---------------------------------------------------
# Richard Walsh
# Project Manager, Cluster Computing, Computational
#                  Chemistry and Finance
# netASPx, Inc.
# 1200 Washington Ave. So.
# Minneapolis, MN 55415
# VOX:    612-337-3467
# FAX:    612-337-3400
# EMAIL:  rbw at networkcs.com, richard.walsh at netaspx.com
#         rbw at ahpcrc.org
#
#---------------------------------------------------
# Nullum magnum ingenium sine mixtura dementiae fuit. 
#                                  - Seneca 
#---------------------------------------------------

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf