[Beowulf] Adding Latency to a Cluster Environment

Fri Feb 13 17:46:38 EST 2004

On Fri, 13 Feb 2004, Don Holmgren wrote:

> I tried something like this a couple of years ago to investigate the
> bandwidth and latency sensitivity of an application which was using
> MPICH over Myrinet.

... which is pretty different from the setup of the original poster :-)
But I'd like to see it discussed in general, so let's go on.

> a modified version of the "mcp" for Myrinet which added ...

Is this publicly available ? I'd like to give it a try.

> The modifications on top of the OSU modifications to gm

Well, that's a very important point: using GM, which doesn't try to make 
too many things like TCP does. I haven't used GM directly nor looked at 
its code, but I think that it doesn't introduce delays, like TCP does in 
some cases. Moreover, based on the description in the GM docs, GM is not 
needed to be optimized by the compiler as it's not in the fast path. 
Obviously, in such conditions, the results can be relied upon.

> Adding a 50 microsecond busy loop, say, to the beginning of an MPI_*Send
> call is going to perturb your results because the processor won't be
> doing useful work during that time.

In the case of TCP, the processor doesn't appear to be doing anything
useful for "long" times, as it spends time in kernel space. So, a 50
microseconds busy loop might not make a difference. And given the somehow
non-deterministic behaviour of TCP in this respect, it might be that
adding the delay before the PMPI_* or after PMPI_* calls might make a
difference.

The delays don't have to be busy-loops. Busy-loops are probably precise,
but might have some side-effects; for example, reading some hardware
counter (even more as it is on a PCI device, which is "far" from the CPU
and might be even "farther" if it has any PCI bridge(s) in between)  
repeatedly will generate lots of "in*" operations during which the CPU is
stalled waiting for data. Especially with today's CPU speeds, I/O 
operations are expensive in terms of CPU cycles...

> You are likely interested in delays of 10's of microseconds.

Well, it depends :-) The latencies for today's HW+SW seem to be in a range
of about 2 orders of magnitude, so giving absolute figures doesn't make
much sense IMHO. Apart from this I would rather suggest an exponential
increase in the delay value.

>  - some ethernet interfaces have very sophisticated processors aboard.
>    IIRC there were gigE NICs (Broadcom, maybe???) which had a MIPS cpu.

Well, if the company releases enough documentation about the chip, then
yes ;-) 3Com has the 990 line which is still FastE but has a programmable
processor, so it's not only GigE.

>    Obviously this has the huge disadvantage of being specific to
>    particular network chips.

But there aren't so many programmable network chips these days. Those 
Ethernet chips might even be in wider use than Myrinet[1] and more people 
might benefit from such development. If I'd have to choose for the next 
cluster purchase the GigE network cards and I'd know that one offers such 
capabilities while not having significant flaws compared to the others, 
I'd certainly buy it.

Another hardware approach: the modern 3Com cards driven by 3c59x, Cyclone
and Tornado, have the means to delay a packet in their (hardware) Tx
queue. There is however a catch: there is not guarantee that the packet
will be sent at the exact time specified, it can be delayed; the only
guarantee is that the packet is not sent before that time. However, I 
somehow think that this is true for most other approaches, so it's not so 
bad as it sounds :-)
The operation is pretty simple, as the packet is "stamped" with the time 
when it should be transmitted, expressed as some internal clock ticks. 
Only one "in" operation to read the current clock is needed per packet, so 
this is certainly much less intrusive as the busy-loop.
[ I'm too busy (but not busy-looping :-)) to try this at the moment. If
somebody feels the urge, I can provide some guidance :-) ]

However, anything that still uses TCP (as both your Broadcom approach and 
my 3Com one do) will likely generate unreliable results...

> it would be great to simulate performance of different network
> architectures on specific applications.

Certainly ! Especially as this would provide means to justify spending 
money on fast interconnect ;-)

[1] I don't want this to look like I'm saying "compared with Myrinet as
it's the most widely used high-performance interconnect" and neglect
Infiniband, SCI, etc; I have no idea about "market share" of the different
interconnects. I compare with Myrinet because the original message talked
about it and because I'm ignorant WRT programmable processors on other
interconnect NICs.

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf