[Beowulf] 1.2 us IB latency?
Hakon.Bugge at scali.com
Wed Apr 25 05:31:05 EDT 2007
At 17:55 24.04.2007, Ashley Pittman wrote:
>That would explain why qlogic use PIO for up to 64k messages and we
>switch to DMA at only a few hundred. For small messages you could best
>describe what we use as a hybrid of the above descriptions, we write the
>a network packet across the PCI bus and don't DMA at all.
I assume QsNet has to do something with the
packet after it has been written to the HCA.
Since the outbound PCI address space is only
32-bits (who needs more than 4GigB of CSR, other
than cluster people attempting to map all the
accumulated memory of the nodes in the cluster
into a single address space?), I assume QsNet
uses part of the packet as 64-bit address
information and starts a DMA from the HCA local
buffer to the remove destination.
>The downside to PIO of course is you need a CPU to drive it so besides
>the fact it's slow you can't make do anything asynchronously.
This is a classic tradeoff. Most applications
_create_ the message before it is sent (contrary
to many p2p benchmarks). Hence, it resides in the
L1 or L2 cache of the CPU with a (MOESI) Modified
state. It is the very efficient to use the CPU to
read its local cache and write the message using
the WC buffer. Contrary, the HCA has to issue a
DMA read to memory, the CPU cache(s) is snooped,
data is transferred to the memory _and_ to the
HCA. The cache state ends up in Shared state, and
a bus transaction is required in order to make it
Modified again (when the buffer is written the next time).
>That's an interesting theory, but I suspect your numbers are a little
>out. My own measurements put a PIO word write in the region of .15 uSec
>depending on chipset. Of course if you are right then the remaining PIO
>write is happening in 1 uSec which leaves only .2uSec for the network
>which seems a little fast to me.
Just to make sure we compare the same thing; the
.15usec is the time from the CPU issuing the
store instruction until the side effect is
visible in the HCA? In other words, assume a CSR
word read takes 0.5usec, a loop writing and
reading the same CSR take 0.65usec, right? If
that the case, CSR accesses have improved radically the last years.
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf