Athlon memory speed asymmetry

Robert G. Brown rgb at phy.duke.edu
Tue Feb 25 17:02:51 EST 2003


On Tue, 25 Feb 2003, David Mathog wrote:

> I'm observing an odd asymmetry in memory utilization on
> Athlons.  Trivial array operations (read then write)
> are up to 30% faster going up through the array then
> down through the array.
...
> Here's a tiny example program (97 lines, mostly comments):
...
> Anybody else seen this before?
> Who knows what causes it?
> Is there a way around it on these Athlons?

Very interesting.  No, I hadn't noticed this one, although I have seen
some real oddities in arithmetic rates before on both intel and AMD.

To get more precise timing measurements of the differential and to be
able to study them as a function of vector size and stride, I embedded
this in the memtest portion of cpu_rate.  The cpu_rate memtest just does
a simple read/write of a memory location in an unsigned int vector whose
length can be specified on the command line, at a stride that can also
be specified on the command line, wrapped up in a timing harness that
subtracts off the time required for the loop itself and any embedding
overhead (I hope).  It thus lets you time the microscopic operation at a
fairly high precision in different cache contexts, which may help us
figure this out.

The cpu_rate tool (with the reverse sequential memtest) is available on
the brahma site:

  http://www.phy.duke.edu/brahma

or my personal website, http://www.phy.duke.edu/~rgb under the Beowulf
tab.

Some results from cpu_rate are given below, on my 1333 Mhz Athlon (with
PC 2100 DDR).

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


#
========================================================================
# Timing "Empty" Loop
# Samples = 100  Loop iterations per sample = 4096
# Time:  2291.8379 +/-     0.8568 nsec
#
========================================================================
# Timing test 7
# Samples = 100  Loop iterations per sample = 1024
# Time:  6247.6912 +/-    60.0939 nsec 
#========================================================================
# Sequential Integer Memory (read/write) Access:
# size = 1000  stride = 1  vector length = 4000:
#   aitmp = ai[aindex]
#   ai[aindex] = aitmp
#   where aindex = ai[i] = i initially.
# NANOTIMER granularity (nsec/cycle) =  0.750
# avg_time_full = 3.123846 avg_time_empty = 1.145919 
# Average Time:   1.98 nanoseconds
# BogomegaRate: 505.58 megaseqmem int read/writes per second

rgb at ganesh|T:901>cpu_rate -t 8
#
========================================================================
# Timing "Empty" Loop
# Samples = 100  Loop iterations per sample = 4096
# Time:  2293.3385 +/-     0.8943 nsec
#
========================================================================
# Timing test 8
# Samples = 100  Loop iterations per sample = 1024
# Time:  6250.7871 +/-    59.6240 nsec 
#========================================================================
# Sequential (backwards) Integer Memory (read/write) Access:
# size = 1000  stride = 1  vector length = 4000:
#   aitmp = ai[aindex]
#   ai[aindex] = aitmp
#   where aindex = ai[i] = i initially.
# NANOTIMER granularity (nsec/cycle) =  0.750
# avg_time_full = 3.125394 avg_time_empty = 1.146669 
# Average Time:   1.98 nanoseconds
# BogomegaRate: 505.38 megabackseqmem int read/writes per second

rgb at ganesh|T:906>cpu_rate -t 9
#
========================================================================
# Timing "Empty" Loop
# Samples = 100  Loop iterations per sample = 4096
# Time:  2291.3671 +/-     0.7767 nsec
#
========================================================================
# Timing test 9
# Samples = 100  Loop iterations per sample = 1024
# Time:  6227.6784 +/-     7.6977 nsec 
#========================================================================
# Random Integer Memory (read/write) Access:
# size = 1000  stride = 1  vector length = 4000:
#   aitmp = ai[aindex]
#   ai[aindex] = aitmp
#   where aindex = ai[i] = shuffled index.
# NANOTIMER granularity (nsec/cycle) =  0.750
# avg_time_full = 3.113839 avg_time_empty = 1.145684 
# Average Time:   1.97 nanoseconds
# BogomegaRate: 508.09 megarandmem int read/writes per second

>From this we see that if the vector fits entirely into L1 cache, forward
access, backward access, and random access take about the same amount of
time.  Note that the code being timed in the test is:

 for (i=0; i<size; i+=stride){
   aindex = ai[i];
   aitmp = ai[aindex];
   ai[aindex] = aitmp;
 }

less the time required for (separately evaluated):

 for (i=0; i<size; i+=stride){
   aindex = ai[i];
 }

where the only difference between sequential, reverse sequential and
random is how the ai[] vector is filled with its own indices.  There
will therefore be a small penalty when the vector's two ends won't both
fit in cache at the same time.

As we increase the size of the vector to roughtly half of L2 (and much
bigger than L1):

rgb at ganesh|T:910>cpu_rate -t 7 -s 32000
#
========================================================================
# Timing "Empty" Loop
# Samples = 100  Loop iterations per sample = 128
# Time: 77094.8508 +/-    91.4151 nsec
#
========================================================================
# Timing test 7
# Samples = 100  Loop iterations per sample = 32
# Time: 231774.8517 +/-   166.2378 nsec 
#========================================================================
# Sequential Integer Memory (read/write) Access:
# size = 32000  stride = 1  vector length = 128000:
#   aitmp = ai[aindex]
#   ai[aindex] = aitmp
#   where aindex = ai[i] = i initially.
# NANOTIMER granularity (nsec/cycle) =  0.750
# avg_time_full = 3.621482 avg_time_empty = 1.204607 
# Average Time:   2.42 nanoseconds
# BogomegaRate: 413.76 megaseqmem int read/writes per second

rgb at ganesh|T:909>cpu_rate -t 8 -s 32000
#
========================================================================
# Timing "Empty" Loop
# Samples = 100  Loop iterations per sample = 128
# Time: 77434.2657 +/-   517.2918 nsec
#
========================================================================
# Timing test 8
# Samples = 100  Loop iterations per sample = 32
# Time: 220100.5173 +/-   134.2393 nsec 
#========================================================================
# Sequential (backwards) Integer Memory (read/write) Access:
# size = 32000  stride = 1  vector length = 128000:
#   aitmp = ai[aindex]
#   ai[aindex] = aitmp
#   where aindex = ai[i] = i initially.
# NANOTIMER granularity (nsec/cycle) =  0.750
# avg_time_full = 3.439071 avg_time_empty = 1.209910 
# Average Time:   2.23 nanoseconds
# BogomegaRate: 448.60 megabackseqmem int read/writes per second

rgb at ganesh|T:918>cpu_rate -t 9 -s 32000
#
========================================================================
# Timing "Empty" Loop
# Samples = 100  Loop iterations per sample = 128
# Time: 76797.9778 +/-    39.3887 nsec
#
========================================================================
# Timing test 9
# Samples = 100  Loop iterations per sample = 32
# Time: 400156.5388 +/-   168.9157 nsec 
#========================================================================
# Random Integer Memory (read/write) Access:
# size = 32000  stride = 1  vector length = 128000:
#   aitmp = ai[aindex]
#   ai[aindex] = aitmp
#   where aindex = ai[i] = shuffled index.
# NANOTIMER granularity (nsec/cycle) =  0.750
# avg_time_full = 6.252446 avg_time_empty = 1.199968 
# Average Time:   5.05 nanoseconds
# BogomegaRate: 197.92 megarandmem int read/writes per second

These numbers are pretty reproducible, and suggest that as long as the
vector is entirely in L2 it makes little difference whether it is
accessed forwards or backwards, and backwards actually seems to have a
tiny advantage, at least for this code.  Random access times, on the
other had, are stretching out.  This also suggests that cache is
"working" pretty well -- L1 is hiding L2 memory access times for either
sequential direction of access, but random access patterns defeat about
half its ability to do so (assuming L1 times of 1 ns for read or write,
L2 times of about 9-10 ns).

When we stretch the vector out to much larger than L2:

rgb at ganesh|T:919>cpu_rate -t 7 -s 1000000
#
========================================================================
# Timing "Empty" Loop
# Samples = 100  Loop iterations per sample = 2
# Time: 11713794.0611 +/-  3828.4599 nsec
#
========================================================================
# Timing test 7
# Samples = 100  Loop iterations per sample = 2
# Time: 21220962.0875 +/- 32389.0768 nsec 
#========================================================================
# Sequential Integer Memory (read/write) Access:
# size = 1000000  stride = 1  vector length = 4000000:
#   aitmp = ai[aindex]
#   ai[aindex] = aitmp
#   where aindex = ai[i] = i initially.
# NANOTIMER granularity (nsec/cycle) =  0.750
# avg_time_full = 10.610481 avg_time_empty = 5.856897 
# Average Time:   4.75 nanoseconds
# BogomegaRate: 210.37 megaseqmem int read/writes per second

rgb at ganesh|T:920>cpu_rate -t 8 -s 1000000
#
========================================================================
# Timing "Empty" Loop
# Samples = 100  Loop iterations per sample = 2
# Time: 11752219.3785 +/- 31999.0185 nsec
#
========================================================================
# Timing test 8
# Samples = 100  Loop iterations per sample = 2
# Time: 30729268.6483 +/- 32411.0175 nsec 
#========================================================================
# Sequential (backwards) Integer Memory (read/write) Access:
# size = 1000000  stride = 1  vector length = 4000000:
#   aitmp = ai[aindex]
#   ai[aindex] = aitmp
#   where aindex = ai[i] = i initially.
# NANOTIMER granularity (nsec/cycle) =  0.750
# avg_time_full = 15.364634 avg_time_empty = 5.876110 
# Average Time:   9.49 nanoseconds
# BogomegaRate: 105.39 megabackseqmem int read/writes per second

rgb at ganesh|T:921>cpu_rate -t 9 -s 1000000
#
========================================================================
# Timing "Empty" Loop
# Samples = 100  Loop iterations per sample = 2
# Time: 11714174.2590 +/-  4242.4690 nsec
#
========================================================================
# Timing test 9
# Samples = 100  Loop iterations per sample = 2
# Time: 238072616.9525 +/- 92417.8420 nsec 
#========================================================================
# Random Integer Memory (read/write) Access:
# size = 1000000  stride = 1  vector length = 4000000:
#   aitmp = ai[aindex]
#   ai[aindex] = aitmp
#   where aindex = ai[i] = shuffled index.
# NANOTIMER granularity (nsec/cycle) =  0.750
# avg_time_full = 119.036308 avg_time_empty = 5.857087 
# Average Time: 113.18 nanoseconds
# BogomegaRate:   8.84 megarandmem int read/writes per second
80.960user 0.013sys 99.8%, 0ib 0ob 0tx 0da 0to 0swp 1:21.10

These are very interesting numbers indeed.  Sequential latencies are
still about half what I think the L2 latency itself is at 4-5
nanoseconds, suggesting that memory prefetch is doing remarkably well at
BOTH moving the data from memory into L2 AND moving the data from L2
into L1 before it is actually needed.  Pretty remarkable, really -- the
system is only about 2.5x slower than when it ran straight out of L1.

Reverse sequential latencies (where some of the cache's efficiency is
admittedly degraded due to the particular order of access of ai[]
elements in the loops above) seems to be very nearly what I expect L2
latency to be -- around 9-10 nanoseconds.  Perhaps a better way of
splitting things out would push it as low as 8 nanoseconds, but this is
anywhere from 60-100% slower than sequential access.  It is still quite
fast.

There are a number of possible interpretations of these numbers.  One
might be that the memory prefetch is >>still<< working pretty well from
memory into L2, but that the data isn't making it through into L1 as
effectively as it does when the entire vector is in L2?  

Alternatively, the number could just be an average of increases in the
reversed prefetch time from memory into L2 that happen to work out to
close to the L2 access time.  In this case, it could be SDRAM DDR itself
that is partly at fault.  I seem to recall that it has a fairly hefty
penalty for the latency of the first byte, and significantly reduced
penalties for subsequent SEQUENTIAL bytes.  It may be that reverse order
access prevents it from realizing this advantage.  It would be very
interesting to rewrite the streaming tests so that they proceeded by one
byte at a time (up or down), two bytes, four bytes, eight bytes, with
various fairly deliberate starting offsets.  Could the reverse slowdown
be associated with e.g. word boundaries?  This would require some
additional tests (and maybe a new parameter for the offset) to be added
to cpu_rate, but it might be worthwhile to give it a shot.

I don't know if this helps at all, but it was an interesting afternoon
playing and forced me to clean up cpu_rate (again).  Maybe even enough
that it won't segfault the first time somebody tries it:-).



_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list