InfiniBand for the Masses

Published on Sunday, 27 January 2008 10:10
Written by Jeff Layton
Hits: 24001

From the "I'll take eight department"

The Linux cluster world is moving towards InfiniBand for many reasons: bandwidth, latency, message rate, N/2, price/performance, and other factors that affect performance and price. But usually it's focused on larger systems, many times greater 64 nodes up to multiple thousand nodes. At that same time the reasons for moving to InfiniBand are still valid for smaller clusters, particularly performance, but the economics are not. Basically InfiniBand is just too expensive for smaller systems and usually does not make sense from a price/performance perspective. But that has just changed...

The Rise of InfiniBand

InfiniBand has made a remarkable rise in performance since inception. Just a few years ago, Single Data Rate (SDR) InfiniBand was the standard. SDR has a 10Gbit/s signaling rate and about a 8Gbit/s data rate (recall that GigE is 1GBit/s for signaling and data). Coupled with these high bandwidths was a much lower latency and CPU overhead. The performance of InfiniBand was a very attractive feature that attracted cluster people to it like moths to a flame.

{mosgoogle right}

The very first InfiniBand products were pricey. Shortly thereafter, the price started to drop to the point where you could get SDR InfiniBand for less than $1,500 a node (includes the HCA or IB card, cable, and switch port costs). Sometimes you could get it for less than $1,000 a node. In short order it became a much selected interconnect for clusters.

Not long after SDR was out, Double-Data Rate (DDR) InfiniBand came out. DDR InfiniBand has a 20Gbit/s signaling rate and about a 16Gbit/s data rate. Basically you had twice the bandwidth of SDR. In conjunction with the bandwidth increase was a drop in latency. Initially DDR was priced just a bit above SDR, but quickly DDR was priced the same as SDR. So now you could get twice the bandwidth and lower latency compared to SDR for less than $1,200 a node. Consequently, SDR all but disappeared.

Recently Mellanox has announced that Quad-Data Rate (QDR) InfiniBand silicon for the HCA's was available and silicon for QDR switches would be available soon. QDR InfiniBand now has a signal rate of 40Gbit/s and a data rate of about 32Gbit/s. You should start to see QDR HCA's and switches for purchase in late Q3 or Q4 of this year.

Overall InfiniBand provides performance benefits to many applications including those that use MPI and also the those used in the traditional data centers such as Oracle, VMWare, financial etc. The ever-growing demands for compute capabilities for those applications drive the growth of InfiniBand.

A Quick Network Comparison

As you are probably aware of, the network can have a big impact on code performance, particularly if you are running parallel codes that use MPI (or God help you - PVM). Table One below lists some common publicly reported interconnect characteristics for GigE, low-latency GigE, 10GigE, SDR InfiniBand (two flavors), and DDR InfiniBand.

Table One - Common Network Characteristics

Network
Latency
(microseconds)
Bandwidth
(MBps)
N/2
(bytes)
GigE ~29-120 ~125 ~8,000
Low Latency GigE: GAMMA ~9.5
(MPI)
~125 ~7,600
10 GigE: Chelsio
(Copper)
9.6 ~862 ~100,000+
Infiniband: Mellanox SDR Infinihost (PCI-X) 4.1 760 512
InfiniBand: Mellanox Infinihost III EX SDR 2.6 938 480
InfiniBand: Mellanox Infinohost III EX DDR 2.25 1502 480
Infiniband: Mellanox ConnectX DDR PCIe Gen2 1 1880 256


I don't want to cover the details of these characteristics in this article (here's an article that might help despite it's age). You can see from the table that SDR InfiniBand is still much better than GigE, low-latency GigE, or even 10GigE.

The Rise of SDR InfiniBand

IB is expensive for smaller clusters because the HCA's are fairly expensive and most of the time, the smallest switch you could buy had 24-ports. So if you only had, let's say, 4 to 8 nodes, than the per node cost for the switch was just too high (a factor of 3-4 compared to 24 nodes). But on the application performance side, smaller clusters could use InfiniBand, particularly as the number of cores per node increases. The smaller clusters don't necessarily need to huge bandwidth that DDR InfiniBand offers and many times don't need the extremely low latency of DDR InfiniBand. The bandwidth and latency of SDR InfiniBand will greatly help the applications. But InfiniBand is has always been considered too expensive. Until now.

Mellanox and Colfax International have teamed up to bring back SDR but at a price point that makes it extremely attractive for small clusters. At this point you're saying "Shut up and tell me the prices!" As I tell my children, "Just relax" but I usually end up with something thrown in my general direction. Since I don't want anyone to thrown things at me, let's go over the prices. BTW - the website with all of the prices is here.

Note: The HCA listed in Table Two does not seem to have recent public benchmark data available. Therefore, actual performance may differ from that shown in Table One.

Table Two - SDR Infiniband Pricing from Colfax

Product
Price ($)
without shipping
Colfax Product Description/Part Number
SDR HCA NIC PCI-Express x4
$125
MHES14-XTC InfiniHost III Lx, Single Port 4X InfiniBand / PCI-Express x4,
Low Profile HCA Card, Memory Free, RoHS (R5) Compliant, (Tiger)
8-port 4X SDR switch
$750
Flextronics ODM model F-X430066, 8 Port 4X SDR InfiniBand switch
24-port 4X 1U SDR Infiniband switch (Unmanaged)
$2,400
Flextronics ODM, 4X SDR InfiniBand switch model F-X430060,
24-port 4X SDR w/ Media Adapter Support, one power supply
0.5 meter SDR cable
$35
MCC4L30-00A 4x microGiGaCN latch, 30 AWG, 0.5 meter
1 meter SDR cable
$39
MCC4L30-001 4x microGiGaCN latch, 30 AWG, 1 meter
2 meter SDR cable
$46
MCC4L30-002 4x microGiGaCN latch, 30 AWG, 2 meters
3 meter SDR cable
$52
MCC4L30-003 4x microGiGaCN latch, 30 AWG, 3 meters
4 meter SDR cable
$58
MCC4L28-004 4x microGiGaCN latch, 28 AWG, 4 meters
5 meter SDR cable
$65
MCC4L28-005 4x microGiGaCN latch, 28 AWG, 5 meters
6 meter SDR cable
$86
MCC4L24-006 4x microGiGaCN latch, 24 AWG, 6 meters
7 meter SDR cable
$93
MCC4L24-007 4x microGiGaCN latch, 24 AWG, 7 meters
8 meter SDR cable
$99
MCC4L24-008 4x microGiGaCN latch, 24 AWG, 8 meters


So let's do a little math. Table Three below has the InfiniBand prices for 8 nodes.

Table Three - 8 nodes with SDR InfiniBand

HCA
Price ($) without shipping
HCA's (8 of them) $1,000
8-port SDR switch $750
1 meter CX-4 cables (8 of them) $280
Total $2,030
Price Per Node $253.75


So if you buy SDR InfiniBand for 8 nodes you will pay less than $255 a node! (without shipping of course).

Let's do the same thing for a 24 node SDR cluster

Table Four - 24 nodes with SDR InfiniBand

HCA
Price ($) without shipping
HCA (24 of them) $3,000
24-port SDR switch $2,400
1 meter CX-4 cables (24 of them) $840
Total $6,240
Price Per Node $260.00


The price is slightly higher than for 8-ports because of the switch costs. I'm not sure about you, but this is a fantastic price and is moving down in the general direction of GigE! (Well, not quite, but it's getting there!)

How do I Get Me Some of That?

Ordering SDR InfiniBand at these prices is easy. Colfax International has set up a webpage that allows you to order on-line! Just go to the page and place your order. If you need large quantities or special arrangements please send an email to sales( you know what to put here) colfaxdirect.com.

Please Note: ClusterMonkey or any of its authors have no financial interest in Colfax International. We just like cheap hardware.

To Infinity and Beyond!

I hate to end in a Buzz Light-year quote, but it seems somewhat appropriate. For smaller clusters you usually had to rely on GigE as the interconnect. Now you can afford to add SDR InfiniBand to these systems without it being too expensive. So this means we now get a big boost in performance on these smaller systems (including the one in my basement! Woo! Hoo!). Now we can truly begin to think outside the box or more like outside the server room.

{mosgoogle right}

We can start thinking about adding a parallel file system to these smaller clusters or even think about exporting NFS over native IB protocols from the master node. Also don't forget that you can run TCP over IB. (See the The OpenFabrics Alliance for the complete software stack.) Even with SDR InfiniBand you will get much faster TCP performance over IB than GigE. So you can start thinking about applications or places were GigE limits performance (anyone wants to play multi-player games using IPoIB?).


Jeff Layton is having way too much fun writing this article, proving that it's hard to keep a good geek down. When he's not creating havoc in his household, he can be found hanging out at the Fry's coffee shop (never during working hours) and admiring the shiny new CPUs that come in, and cringing when someone buys Microsoft Vista.

Unfortunately you have Javascript disabled, please enable Javascript in order to experience the comments correctly