[Beowulf] Adding Latency to a Cluster Environment

Fri Feb 13 13:15:28 EST 2004

On Fri, 13 Feb 2004 joshh at cs.earlham.edu wrote:

> Here is an irregular question. I am profiling a software package that runs
> over LAM-MPI on 16 node clusters [Details Below]. I would like to measure
> the effect of increased latency on the run time of the program.
>
> It would be nice if I could quantify the added latency in the process to
> create some statistics. If possible, I do not want to alter the code line
> of the program, or buy new hardware. I am looking for a software
> solution/idea.
>
> Bazaar Cluster:
> 16 Node Red Hat Linux machines running 500MHz PIII, 512MB RAM
> 1 100Mbps NIC card in each machine
> 2 100Mbps Full-Duplex switches
>
> Cairo Cluster:
> 16 Node YellowDog Linux machines running 1GHz PPC G4, 1GB RAM
> 2 1Gbps NIC cards in each machine (only one in use)
> 2 1Gbps Full-Duplex switches
>
> For more details on these clusters follow the link below:
> http://cluster.earlham.edu/html/
>
> Thank you,
>
> Josh Hursey
> Earlham College Cluster Computing Group
>

Not an irregular question at all.

I tried something like this a couple of years ago to investigate the
bandwidth and latency sensitivity of an application which was using
MPICH over Myrinet.  One of D.K.Panda's students from Ohio State
University had a modified version of the "mcp" for Myrinet which added
quality of service features, tunable per connection.  The "mcp" is the
code which runs on the LANai microprocessor on the Myrinet interface
card. The modifications on top of the OSU modifications to gm used a
hardware timer on the interface card to add a fixed delay per packet for
bandwidth tuning, and a fixed delay per message (i.e., a delay added to
only the first packet of a new connection) for latency tuning.  Via
netpipe, I verified that I could independently tune the bandwidth and
latency.  Lots of fun to play with - for example, by plotting the
difference in message times for two different latency setting, the
eager-rendezvous threshold was easily identified.  All in all a very
useful experiment which told us a lot about our application.

Clearly, you want to delay the sending of a message, or the processing
of a received communication, without otherwise interfering with what the
system is doing.  Adding a 50 microsecond busy loop, say, to the
beginning of an MPI_*Send call is going to perturb your results because
the processor won't be doing useful work during that time.  That's
obviously not the same as running on a network with a switch that adds
the same 50 microseconds latency; in that case, the processor could be
doing useful work during the delay, happily overlapping computations
with communications.

Nevertheless, adding busy loops might still give you useful results.
You might want to look into using a LD_PRELOAD library to intercept MPI
calls of interest, assuming you're using a shared library for MPI.  In
your version, do the busy loop, then fall into the normal call.  A quick
google search on "LD_PRELOAD" or "library interposers" will return a lot
of examples, such as:
    http://uberhip.com/godber/interception/index.html
    http://developers.sun.com/solaris/articles/lib_interposers.html
The advantage of this approach is that no modifications to your source
code or compiled binaries are necessary.  You'll have to think carefully
about whether the added latency is slowing your application simply
because the processor is not doing work during the busy loop.  If I were
you, I'd modify your source code and time your syncronizations (eg,
MPI_Wait).  If your code is cpu-bound, these will return right away, and
adding latency via a busy loop is going to give you the wrong answer.
If your code is communications bound, these will have a variable delay
depending upon the latency and bandwidth of the network.

You are likely interested in delays of 10's of microseconds.  The most
accurate busy loops for this sort of thing use the processor hardware
timers, which tick every clock on x86.  On a G5 PPC running OS-X, the
hardware timer ticks every 60 cpu cycles.  I'm not sure what a PPC does
under Linux.  On x86, you can read the cycle timer via:
   #include <asm/msr.h>
   unsigned long long timerVal;
   rdtscll(timerVal);

A crude delay loop example:

   rdtscll(timeStart);
   do {
      rdtscll(timeEnd);
   } while ((timeEnd - timeStart) < latency * usecPerTick);

where latency is in microseconds, and usecPerTick is your calibration.

There have been other recent postings to this mailing list about using
inline assembler macros to read the time stamp counter.

Injecting small latencies w/out busy loops and without disturbing your
source code is going to be very difficult (though I'd love to be
contradicted on that statement!).  A couple of far fetched ideas in
kernel land:

 - some ethernet interfaces have very sophisticated processors aboard.
   IIRC there were gigE NICs (Broadcom, maybe???) which had a MIPS cpu.
   Perhaps the firmware can be modified similarly to the modified mcp
   for gm discussed above.  Obviously this has the huge disadvantage of
   being specific to particular network chips.

 - the local APIC on x86 processors has a programmable interval timer
   with better than  microsecond granularity which can be used to
   generate an interrupt.  Perhaps in the communications stack, or in
   the network device driver, a wait_queue could be used to postpone
   processing until after an interrupt from this timer.  I would worry
   about considerable jitter, though.
   For a sample driver using this feature,
   see
        http://www.oberle.org/apic_timer-timers.html
   The various realtime Linux folks talk about this as well:
        http://www.linuxdevices.com/articles/AT6105045931.html
   Unfortunately, IIRC this timer is now used (since 2.4 kernel) for
   interprocessor interrupts on SMP systems.  On uniprocessor systems it
   may still be available.

I hope there's something useful for you in this response.  I'm hoping
even more that there are other responses to your question - I would love
a facility which would allow me to "turn the dial" on latency and/or
bandwidth.  There's a substantial cost difference between a gigE cluster
and a Myrinet/Infiniband/Quadrix/SCI cluster, and it would be great to
simulate performance of different network architectures on specific
applications.

Don Holmgren
Fermilab
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf