[Beowulf] 1.2 us IB latency?

Håkon Bugge Hakon.Bugge at scali.com
Wed Apr 25 05:31:05 EDT 2007

At 17:55 24.04.2007, Ashley Pittman wrote:
>That would explain why qlogic use PIO for up to 64k messages and we
>switch to DMA at only a few hundred.  For small messages you could best
>describe what we use as a hybrid of the above descriptions, we write the
>a network packet across the PCI bus and don't DMA at all.

I assume QsNet has to do something with the 
packet after it has been written to the HCA. 
Since the outbound PCI address space is only 
32-bits (who needs more than 4GigB of CSR, other 
than cluster people attempting to map all the 
accumulated memory of the nodes in the cluster 
into a single address space?), I assume QsNet 
uses part of the packet as 64-bit address 
information and starts a DMA from the HCA local 
buffer to the remove destination.

>The downside to PIO of course is you need a CPU to drive it so besides
>the fact it's slow you can't make do anything asynchronously.

This is a classic tradeoff. Most applications 
_create_ the message before it is sent (contrary 
to many p2p benchmarks). Hence, it resides in the 
L1 or L2 cache of the CPU with a (MOESI) Modified 
state. It is the very efficient to use the CPU to 
read its local cache and write the message using 
the WC buffer. Contrary, the HCA has to issue a 
DMA read to memory, the CPU cache(s) is snooped, 
data is transferred to the memory _and_ to the 
HCA. The cache state ends up in Shared state, and 
a bus transaction is required in order to make it 
Modified again (when the buffer is written the next time).

>That's an interesting theory, but I suspect your numbers are a little
>out.  My own measurements put a PIO word write in the region of .15 uSec
>depending on chipset.  Of course if you are right then the remaining PIO
>write is happening in 1 uSec which leaves only .2uSec for the network
>which seems a little fast to me.

Just to make sure we compare the same thing; the 
.15usec is the time from the CPU issuing the 
store instruction until the side effect is 
visible in the HCA? In other words, assume a CSR 
word read takes 0.5usec, a loop writing and 
reading the same CSR take 0.65usec, right? If 
that the case, CSR accesses have improved radically the last years.


Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


More information about the Beowulf mailing list