[Beowulf] How Would You Test Infiniband in New Cluster?

Bill Broadley bill at cse.ucdavis.edu
Tue Nov 17 17:46:43 EST 2009

Jon Forrest wrote:
> Bill Broadley wrote:
>> My first suggest sanity test would be to test latency and bandwidth to
>> insure
>> you are getting IB numbers.  So 80-100MB/sec and 30-60us for a small
>> packet
>> would imply GigE.  6-8 times the bandwidth certainly would imply SDR or
>> better.  Latency varies quite a bit among implementation, I'd try to get
>> within 30-40% of advertised latency numbers.
> For those of us who aren't familiar with IB utilities,
> could you give some examples of the commands you'd use
> to do this?
> Thanks,
> Jon

Here's 2 that I use:

So to compile, assuming a sane environment:
mpicc -O3 relay.c -o relay

The command to run an MPI program varies by environment and mpi
implementation, and batch queue environment (especially tight integration).
It should be something close to:
mpirun -np <number of nodes> -machinefile <list of nodes> ./relay 1
mpirun -np <number of nodes> -machinefile <list of nodes> ./relay 1024
mpirun -np <number of nodes> -machinefile <list of nodes> ./relay 8192

You should see something like:
c0-8 c0-22
size=     1,  16384 hops,  2 nodes in   0.75 sec ( 45.97 us/hop)     85 KB/sec
c0-8 c0-22
size=  1024,  16384 hops,  2 nodes in   2.00 sec (121.94 us/hop)  32803 KB/sec
c0-8 c0-22
size=  8192,  16384 hops,  2 nodes in   6.21 sec (379.05 us/hop)  84421 KB/sec

So basically on a tiny packet 45us of latency (normal for gigE), and on a
large package 84MB/sec or so (normal for GigE).

I'd start with 2 nodes, then if you are happy try it with all nodes.

Now for infiniband you should see something like:

c0-5 c0-4
size=     1,  16384 hops,  2 nodes in   0.03 sec (  1.72 us/hop)   2274 KB/sec
c0-5 c0-4
size=  1024,  16384 hops,  2 nodes in   0.16 sec (  9.92 us/hop) 403324 KB/sec
c0-5 c0-4
size=  8192,  16384 hops,  2 nodes in  0.50 sec ( 30.34 us/hop) 1054606 KB/sec

Note the latency is some 25 times less and the bandwidth some 10+ times
higher.  Note the hostnames are different, don't run multiple copies on the
same node unless you intend to.  Running 4 copies on a 4 cpu node doesn't test

So once you get what you expect I'd suggest something a bit more
comprehensive.  Something like:
mpirun -np <number of nodes> -machinefile <list of nodes> ./mpi_nxnlatbw

I'd expect some different in latency and bandwidth between nodes, but not any
big differences.  Something like:
[0<->1]		1.85us		1398.825264 (MillionBytes/sec)
[0<->2]		1.75us		1300.812337 (MillionBytes/sec)
[0<->3]		1.76us		1396.205242 (MillionBytes/sec)
[0<->4]		1.68us		1398.647324 (MillionBytes/sec)
[1<->0]		1.82us		1375.550155 (MillionBytes/sec)
[1<->2]		1.69us		1397.936020 (MillionBytes/sec)

Once those numbers are consistent and where you expect them (both latency and
bandwidth) I'd follow up with a production code that produces a known answer
and is likely to provide much wider MPI coverage.

Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list