[Beowulf] Re: dual core (latency)

Stuart Midgley sdm900 at gmail.com
Tue Jul 19 01:22:09 EDT 2005

I like your email style :)

a) reading doesn't prevent snooping, it causes it.  You need to snoop  
all the caches to make sure the cache line isn't on some other cpu  
before you go to main memory

b) nothing is free - cache snooping costs a lot (even more advanced  
methods like page caches - see SGI Altix systems - cost a lot)

c) cores being idle has absolutely nothing to do with cache snooping  
(unless you have to flush from a higher level cache or register).  A  
cpu doesn't know priori that a cpu doesn't have process on it or that  
it isn't holding an old cache line.

d) I would expect dual cores to have a larger latency... as per my  
previous argument

e) I guess this is an interesting point.

Actually, you would be surprised how MUCH bandwidth and latency have  
to do with each other in computers.  They are VERY tightly coupled.   
For example...  you have a cpu with dual channel DDR3200 memory  
attached.  So you think your bandwidth is 6.4GB/s... then why does  
streams show a maximum of around 3-4GB/s?  Where did the other ~2.5GB/ 
s go?

Now, if you look at the actual bandwidth of loading a single cache line:

a cache line is 128bytes which can be access at 6.4GB/s so it takes  
128/6.4/1024/1024/1024 s to get = 18.6ns

take into account the ~125ns latency and you can get the 128byte  
cache line in about 143ns which gives a bandwidth of 0.93GB/s.

Now, given that the pentium can have 4 outstanding cache loads misses  
you can in effect over lay 4 operations and 1/4 the latency to around  
45ns to give around 2.4GB/s to get the same 128 byte cache line.   
Now, take into account all the other outstanding factors: some memory  
is already in fast caches; that you can't quite 1/4 the latency; 4  
operations don't quite happen simultaneously due to the 18ns it takes  
to get the data etc.

The end result is that latency has a MASSIVE impact on real bandwidth.


> This doesn't answer even remotely accurate things.
> A) my test is doing no WRITES, just READS.
> B) snooping might be for free.
> C) all other cores are just idle when such a latency test for just  
> 1 core
> happens and the rest of the system is idle.
> D) in all cases a dual core processor has a SLOWER latency and it  
> doesn't
> make sense.
> E) you don't seem to grasp the difference between LATENCY and  
> For example your BANDWIDTH to Mars might be GREAT, but your LATENCY  
> to Mars
> is real ugly, as it takes 200 years for them to return.
> You keep mixing latency and bandwidth. That's ugly, to say polite.
> I'm speaking of LATENCY here, not bandwidth.
> The total BANDWIDTH that my program takes at a dual core is to be  
> correct:
> 8 bytes * 1 billion (1/ns) / 147 (ns) = 54MB/s
> In fact with some luck your gigabit ethernet card might be able to  
> handle
> 54MB/s.
> Vincent

Dr Stuart Midgley
sdm900 at gmail.com

Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list