[Beowulf] Adding Latency to a Cluster Environment

Fri Feb 13 19:49:05 EST 2004

On Fri, 13 Feb 2004, Bogdan Costescu wrote:

> On Fri, 13 Feb 2004, Don Holmgren wrote:
>
> > I tried something like this a couple of years ago to investigate the
> > bandwidth and latency sensitivity of an application which was using
> > MPICH over Myrinet.
>
> ... which is pretty different from the setup of the original poster :-)
> But I'd like to see it discussed in general, so let's go on.
>
> > a modified version of the "mcp" for Myrinet which added ...
>
> Is this publicly available ? I'd like to give it a try.

I'm afraid not, sorry, since the modified code base from OSU isn't
publically available.  IIRC it was part of a project for a masters
degree; if it's OK with them, it's OK with me (we can take this
offline).  The modified MCP had a bug I never fixed which required me to
reset the card and reload the driver when some counter overflowed, at
something like a gigabyte of messages.  Long enough to get very good
statistics, though.

>
> > The modifications on top of the OSU modifications to gm
>
> Well, that's a very important point: using GM, which doesn't try to make
> too many things like TCP does. I haven't used GM directly nor looked at
> its code, but I think that it doesn't introduce delays, like TCP does in
> some cases. Moreover, based on the description in the GM docs, GM is not
> needed to be optimized by the compiler as it's not in the fast path.
> Obviously, in such conditions, the results can be relied upon.

I miswrote a bit; to be precise, this was a modification to the MCP,
which is the NIC firmware, rather than to GM, which is the user space
code that interacts with the NIC hardware.  The modification caused the
NIC itself to introduce interpacket delays of a configurable value.  To
the application (well, to MPICH and to GM) it simply looked like the
external Myrinet network had a different bandwidth and/or latency.
There were tiny code changes to MPICH and to GM to allow modification of
the interpacket delay values in the MCP; otherwise I would have had to
recompile or patch the firmware image and reload that image for each new
value.

You are absolutely correct that GM, like all good OS-bypass software,
doesn't introduce the delays that you'd encounter with communications
protocols like TCP that have to pass through the kernel/user space
boundary.  Much more deterministic.

>
> > Adding a 50 microsecond busy loop, say, to the beginning of an MPI_*Send
> > call is going to perturb your results because the processor won't be
> > doing useful work during that time.
>
> In the case of TCP, the processor doesn't appear to be doing anything
> useful for "long" times, as it spends time in kernel space. So, a 50
> microseconds busy loop might not make a difference. And given the somehow
> non-deterministic behaviour of TCP in this respect, it might be that
> adding the delay before the PMPI_* or after PMPI_* calls might make a
> difference.

TCP processing is likely a significant component of the natural latency,
and, as you point out, during that time the CPU is busy in kernel space
and isn't doing useful work.  But the goal here is to add additional
artificial latency in a manner that mimics a slower physical network,
i.e., so that during this artificial delay the application can still be
crunching numbers.  In user space I don't see how to accomplish this
goal (adding latency, yes; adding latency during which the cpu can do
calculations, no).

If delay code is added correctly in kernel space, say in the TCP/IP
stack (sounds like a nasty bit of careful work!), then during that 50
usec period the CPU could certainly be doing useful work in user space.
Small delays, relative to the timer tick, are very difficult to do
accurately in non-realtime kernels unless you have a handy source of
interrupts, like the local APIC.

Assuming that LAM MPI isn't multithreaded (I have no idea), then adding
a delay in the user space code in the MPI call, whether it's a sleep or
a busy loop, guarantees that no useful application work can done during
the delay.

I'm confess to be totally ignorant of the PMPI_* calls (time for
homework!) and defer humbly to the MPI masters from ANL.  I'm definitely
curious as to how these added latencies are implemented.

>
> The delays don't have to be busy-loops. Busy-loops are probably precise,
> but might have some side-effects; for example, reading some hardware
> counter (even more as it is on a PCI device, which is "far" from the CPU
> and might be even "farther" if it has any PCI bridge(s) in between)
> repeatedly will generate lots of "in*" operations during which the CPU is
> stalled waiting for data. Especially with today's CPU speeds, I/O
> operations are expensive in terms of CPU cycles...

Agreed, though I'd hope on x86 that reading the time stamp counter is
very quick and with minimal impact - it's got to be more like a
register-to-register move than an I/O access.  Hopefully on a modern
superscalar processor this doesn't interfere with the other execution
units.

[As I write this, I just ran a program that reads the time stamp counter
back to back to different registers, multiple times.  The difference in
values was a consistent 84 counts or 56 nsec on this 1.5 GHz Xeon - so,
definitely minimal impact.]

Without busy loops, achieving accurate delays of the order of 10's to
100's of microseconds with little jitter is a real trick in user space,
(and kernel space as well!).  nanosleep() won't work, delivering order
10 or 20 msec (i.e., the next timer tick) instead of the 50 usec
request.

>
> > You are likely interested in delays of 10's of microseconds.
>
> Well, it depends :-) The latencies for today's HW+SW seem to be in a range
> of about 2 orders of magnitude, so giving absolute figures doesn't make
> much sense IMHO. Apart from this I would rather suggest an exponential
> increase in the delay value.

True.  I was really thinking of my specific problem, not his!

The relevent latency range for deciding between Infiniband and switched
ethernet is ~ 6 usec to ~ 100+ usec, and the bandwidth range is ~ 100
MB/sec (gigE) to ~ 700 MB/sec (I.B.).  It would be really useful to be
able to inject latencies in that latency range with a precision of 5
usec or so, and to dial the bandwidth with a precision of ~ 50 MB/sec.
Of course, if latency really matters, one would drop TCP/IP and use an
OS-bypass, like GAMMA or MVIA.

> ...
>
> > it would be great to simulate performance of different network
> > architectures on specific applications.
>
> Certainly ! Especially as this would provide means to justify spending
> money on fast interconnect ;-)

What we need is some kind corporate soul to put up a large public
cluster with the lowest latency, highest bandwidth network fabric
available.  Then, we can add our adjustable firmware and degrade that
fabric to mimic less expensive networks, and figure out what we should
really buy.  Works for me!

Don Holmgren
Fermilab
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf