Cluster Interconnects: The Whole Shebang

Article Index

Infinipath

Infinipath is a new interconnect technology developed by Pathscale. It is a variation on IB taking advantage of the Hypertransport bus in Opteron systems and, soon, even PCI-Express systems. In addition to implementing standard InfiniBand protocols, this technology enables accelerated software stacks that have very low latency and high messaging rates.

Pathscale was founded in 2001 by a team of experienced engineers and scientists from Sun, SGI, and HP. In April 2006, PathScale was acquired by QLogic, a well-known maker of storage equipment such as fibre channel HBAs and switches. The focus of Pathscale is on software and hardware solutions that enable Linux clusters to achieve new levels of performance and efficiency. PathScale has 2 flavors of InfiniPath adaptors, one which plugs into the AMD Opteron HyperTransport bus, and one (shipping in late May) which plugs into PCI-Express bus. The NICs are then connected using standard IB switches and cables.

This combination is unique for several reasons. First it uses standard IB switches that reduces costs compared to developing a new switch. Second, in the case of the NIC that plugs directly into the Hypertransport bus, a factor of 2 to 3 times lower latency than other IB NICs can be achieved. And third, as you increase the speed of the Opteron processors, the latency of the Infinipath NIC will continue to decrease.

InfiniPath adaptors work with all InfiniBand switches currently available. InfiniBand compliance is achieved by using the OpenFabrics (formerly OpenIB) software stack. The InfiniPath device driver is part of the kernel.org Linux kernel (as of 2.6.17). In addition to OpenFabrics software, PathScale distributes an accelerated MPI stack based on MPICH-1. This MPI stack, combined with the HyperTransport HTX adaptor, achieves 1.29 microsend latency and more than 8 million messages/second on an 8-core system.

As mentioned Pathscale ships an accelerated MPI for Infinipath. It is based on MPICH 1.2.6 with performance enhancements for Infinipath. Other open-source MPI implementations can support Infinipath using the OpenFabrics software stack. Pathscale also has an IP over Infinipath capability for its accelerated stack. Thus any IP based protocol package should work (PVM or TCP/IP based MPIs), albeit with reduced performance.

As you will see in the performance table at the end of this article, Infinipath is the fastest performing interconnect available for clusters today. It has the lowest latency and the smallest N/2 (defined later). Consequently, it's performance on many HPC codes is very good.

Myrinet

Myrinet was one of the first high-speed networks developed with clustering in mind. Myricom is a privately held company started in 1994 by Charles L. Seitz. Some of the initial technology was developed under two ARPA sponsored research projects. Myrinet became an ANSI standard in 1998 and is now one of the most popular high-speed cluster interconnects, being the dominant non-Ethernet interconnect on the 26th Top500 list. Myrinet is used by one of the fastest clusters in the world, MareNostrum, in Spain.

Myricom has two product lines. Their existing product line, called Myrinet-2000 or Myrinet-2G, has been in successful production use since 2000. Their new product line, called Myri-10G, started shipping at the end of 2005. Myri-10G is based on 10-Gigabit Ethernet at the PHY layer (layer 1), and is dual-protocol Ethernet and Myrinet at the Data Link layer (layer 2).

Myrinet - 2G

Myricom currently has three NICs in the 2G line. All of them use fiber optic connections. The 'D' card is the lowest price card with a 225 MHz RISC processor and a single fiber port. The newer 'F' card uses a 333 MHz RISC processor with a single fiber port. The 'E' card has two NIC ports and uses a 333 MHZ RISC processor. All three NICs are 64-bit PCI-X based cards that are "short"cards (not full length), and are low-profile.

Myrinet 2G is a switched network. The network is based on a Clos design that uses small switch elements to build larger switches. Myrinet currently uses 16-port switches as the basic building block. Myrinet 2G network switches come in several sizes. They have a 2U switch chassis that can accommodate 8-port and 16-port models. For medium size networks, they have a 3U switch chassis for 32-ports, a 5U switch chassis for up to 64-ports, and a 9U switch for up to 128 ports. For larger networks they have a single 14U switch with up to 256 host ports. For even larger networks, you can connect the switches using 'spine' cards creating a Clos network giving full bisection bandwidth to each port. This gives a great deal of flexibility when designing a network topology.

Since Myrinet is focused on clusters, it has taken advantage of the fact that it is not tied to compatibility with general purpose networks such as Ethernet, TCP, and IP. Consequently, it has made changes to improve networking performance, specifically using a different protocol than TCP. Current protocols for the Myrinet 2G line include GM and MX. They use simpler packets than TCP, resulting in better usage of the packets (less overhead). These packets can be of any length so they can contain packets of other types, such as IP packets. In addition, the MX protocol has been redesigned so that it has about half the zero-packet latency of the GM protocol. To further improve performance, Myrinet also uses an OS-bypass like interface to help latency and reduce CPU overhead.

The data packets on Myrinet are source routed which means that each host must know the route to all of the other hosts through the switch fabric. The result is that the NICs do most of the work and the switches can be very simple. Since the switches are not doing much of the work, each NIC must know the full network topology to route the data and the topology must be fairly static.

There are several MPI's available for Myrinet. Myricom uses an open-source MPI called MPICH-GM for the GM protocol and MPICH-MX for the MX protocol. Both are based on MPICH. Open MPI supports GM and will support MX. LAM-MPI support GM. Also since Myrinet can run TCP over GM, you can use any MPI that uses TCP with Myrinet. There is a small performance penalty for running in this manner.

There are several commercial MPI implementations that support Myrinet with GM. Verari System Software (previously MPI Software Technologies) has MPI/Pro and Scali has Scali MPI Connect.

Myri-10G

The Myri-10G product is a new product line from Myricom. It takes the bandwidth up to 10 Gbps, the same as 4X Infiniband and 10 GigE. However, the 10G NIC is different than most NICs because it can take on different personalities depending upon which switch it is plugged into. It you plug it into a Myricom switch, it will use MX as the protocol. If you plug it into a 10 GigE switch, it will talk TCP.

The 10G NIC is PCI-Express only (sorry PCI-X world). Right now it uses 8 lanes (x8 in PCI-Express talk). As with the 2G NICs, it includes a processor and firmware. It does the network protocol processing and also uses OS-bypass to improve performance.

The NICs are 10Gbase-CX4, 10GBase-R, or XAUI over ribbon fiber. The copper 10Gbase-CX4 cables can go up to 15m in length and the 10Gbase-R serial fiber cables are good for 10 GigE.

Since the Myrinet design philosophy puts the intelligence in the NIC, they can use the same basic switch concepts for 10G. So they will have Clos networks with full bisection bandwidth to each port. Initially 16-port, 128-port switches. In 1H 2006 will have 256-port switch.

Quadrics

Quadrics is another company that has focused on on very high performance interconnects for clusters. The Quadrics R&D team grew out of the Meiko supercomputer company. The company was incorporated in 1996 in Europe as part of the Finmeccanica Group. It is headquartered in Britain with regional offices in Italy and the US. It's networking technology, QsNet and the newest version, QsNetII, are focused on providing the lowest latency possible with the largest bandwidth.

In many ways, Quadrics interconnect technology is similar to other high-speed networking technologies but Quadrics is one of the best performing cluster interconnects on the market. The intelligent Quadrics NICs use an OS-bypass capability to help reduce latency. The NICs also use RDMA concepts to reduce copying. This helps both latency and bandwidth.

Quadrics currently is selling two NICs and several switches. The QM400 NIC is a 64-bit 66 MHz PCI NIC and the QM500 is either a 64-bit or 128-bit 133 MHz PCI-X NIC for QsNetII. The QM500 is a new card that has faster on-board processors and uses less power than the QM400. However the QM400 card is still one of the fastest cards for a simple PCI bus.

Quadrics is a switched network like most of the networks in this survey. The basic topology of Quadrics (QsNetII) is to connect the NICs with copper cables (up to 13m in length) or fiber cables for longer distances, to 8 port switch chips that are arranged in a fat-tree configuration. They have a 128-port switch and a 16-port switch for QsNet. These switches can be linked together to form large switching networks in a fat-tree topology so each port has full bisection bandwidth. For QsNetII they have a range of stand-alone switches (E-series: 8-port, 32-ports, 128-ports) and an R-series network of federated switches scalable to up to 4,096 ports.

Quadrics provides an MPI implementation for QsNet and QsNetII. The MPI is based on MPICH from Argonne Labs and is built in top of the low latency libraries written by Quadrics. It also has shmem capability for SMP nodes. Verari System Software (previously MPI Software Technologies) has a commercial MPI, ChaMPIon/Pro, that supports Quadrics. HP-MPI, Intel-MPI, Scali-MPI are also used on Quadrics connect systems.

Like other vendors, Quadrics provides TCP/IP capability on their networks allowing TCP based MPIs and PVM to work as well. The TCP/IP layer has also been used for Lustre and the PVFS2 file systems. Finally, Quadrics offers the Quadrics Resource Management System (RMS) which provides a high level of parallel supercomputing facilities for UNIX and Linux working on top of QsNet.

Dolphin

Dolphin Interconnect Solutions develops, markets, and supports SCI (Scalable Coherent Interface) networking for clusters. Dolphin began as a group of visionary engineers at the Norwegian computer maker Norsk Data. Together with engineers from a number of US computer makers, the SCI (Scalable Coherent Interface) standard was developed and made an official IEEE/ANSI standard in the spring of 1992.

Dolphin Interconnect Solutions was formed in the autumn of 1991 to focus on SCI markets and products. During the period from 1992 to 1996, the core SCI technology was developed and clusters using SCI began to appear. Dolphin Interconnect Solutions is based in Oslo, Norway with sales representation around the world. Product and technology development is conducted by the technical staff in Oslo, in close collaboration with a network of partners, consultants, universities and research institutes.

Dolphin ships single, dual, and triple-port SCI cards in PCI form factors and a dual-port PCI-Express card. SCI is a switchless network. The NIC's are connected to one another in some network topology. For example, you can make a simple ring topology, or a 2D torus topology, or even a 3D topology (i.e. by using multiple ports on the NICs, you create on dimension, two dimensional, and three dimensional topologies). For a switched network, you have to worry about how many open ports you have. If you need more, you have to buy another switch or a bigger switch. To expand an existing SCI system, it usually just involves plugging in the NIC into the network in the appropriate fashion. However, a downside is that if there is a downed node in the network, the links to the node must be routed, thus impacting messaging in the remaining system. On the other hand, it avoids a single point of failure of typical switched networks.

The chips on the NICs handle all of the routing so that a packet, if it is not intended for that node, has no impact on the CPU at all. Consequently, the NICs are very efficient at routine packets to the neighboring nodes.

The flexibility of network design is a big plus for SCI. To design a SCI network, you must choose the layout carefully to avoid saturation. Typically, 8-10 node will saturate a simple SCI ring, 64-100 nodes will saturate a 2D SCI torus, and 640-1000 nodes will saturate a 3D SCI torus.

There are several MPI implementations available for SCI. There is an open-source version based on MPI 1.2 from MPICH called SCI-MPICH. There is also an open-source MPI based on MPICH2 called SCI-MPICH2.

Dolphin offers an open source package called Dolphin SuperSockets. It provides the necessary transparrent glue to boost any available MPI or PVM implementations. It has been tested with MPICH2, LAM-MPI, PVM, and even Lustre, GFS, and iSCSI.

Dolphin also sells a package called Dolphin Sockets. It provides the necessary transparent glue underneath MPI and PVM implementations. It has been tested with MPICH2, LAM-MPI, PVM, and even Lustre, GFS, and iSCSI. Scali also supplies a commercial MPI implementation that works with SCI.

There Is More, But

There are other network technologies we did not mention, but decided to cover the "mainstream" products and technology that are used by the HPC cluster market. Please contact us if you think there is something we missed.

Summary

It's always nice to have a table of data to compare various things. I've created a table to compare the various interconnects I've talked about. In particular, the first table lists latency in microseconds, bandwidth in Mega-bits per second (Mbps), and the N/2 packet size. The second table lists the cost for 8 nodes, 24 nodes, and 128 nodes.

{mosgoogle right}

The N/2 packet size is the size of the packets in bytes that reach half the bandwidth of the interconnect. It is important because it tells if small packets get good bandwidth performance. While latency is important for some codes, other codes depend heavily on bandwidth. Having good bandwidth for small packet sizes is good for these types of codes.

I have made every effort to provide accurate numbers. The information in Table One has either been provided by the vendors or obtained from references on the web.

I want to emphasize that Table One is designed to give you a "ball park" comparison of each technology. You will never be able to predict how well your application(s) work from examining the table. We recommend that you contact the vendors or integrators and discuss your needs with them. Table One can be considered a guide to your discussions. Indeed, the latency and bandwidth were not likely obtained in the exact same manner from each vendor. You should be careful in comparing them to one another.

In particular, using this table to select an interconnect is not a good idea. The table is intended as a 50,000 foot look at cluster interconnect technologies. The performance (and the devil) is in the details. The final selection of an interconnect should be done by testing your codes or a set of benchmark codes that you can correlate to your codes.


Table One - Interconnect Summary: Performance Metrics

interconnect Latency (microseconds) Bandwidth (MBps) N/2 (Bytes)
GigE ~29-120 ~125 ~8,000
GigE: GAMMA ~9.5 (MPI) ~125 ~9,000
GigE with Jumbo Frames 29-120 ~125 ~8,000
GigE: Level 5 15 104.7 NA
10 GigE: Chelsio (Copper) 9.6 ~862 ~100,000+
Infiniband: Mellanox Infinihost (PCI-X) 4.1 760 512
Infiniband: Mellanox Infinihost III EX SDR 2.6 938 480
Infiniband: Mellanox Infinihost III EX DDR 2.25 1502 480
Infinipath: HTX 1.29 954 214
Infinipath: PCI-Express 1.62 957.5 227
Myrinet D (gm) ~7.0 ~493 ~1,000
Myrinet F (gm) ~5.2 ~493 ~1,000
Myrinet E (gm) ~5.4 ~493 ~1,000
Myrinet D (mx) 3.5 ~493 ~1,000
Myrinet F (mx) 2.6 ~493 ~1,000
Myrinet E (mx) 2.7 ~493 ~1,000
Myri-10G 2.0 1,200 ~1,000
Quadrics 1.29 ~875-910 ~576
Dolphin 4.2 457.5 ~800


Table Two is a cost comparison for various cluster sizes and interconnect technologies. Aside from direct price comparisons, it was also created to illustrate how cost scales with cluster size. Please consult the foot notes for the assumptions. Finally, price is often related to performance (i.e. low price may equal low performance). Contact the companies listed in the article for more information about your HPC needs.

Table Two - Interconnect Summary: Pricing

Interconnect 8 Node Cost 24 Node Cost 128 Node Cost
GigE1 $258.00
$944.00
$27,328.00
GigE: GAMMA2 $258.00
$944.00
$27,328.00
GigE with Jumbo Frames3 $308.00
$944.00
$27,328.00
GigE: Level 54 $4,060
$12,200
$83,360.00
10 GigE: Chelsio (Copper)5 $15,960.00
$62,280.00
$447,360.00
Infiniband: Voltaire6 $11,877.00
$23,084.00
$182,083.00
Infinipath7 $13,810.00
$26,530.00
$207,860.00
Myrinet D (gm/mx)8 $7,200.00
$21,600.00
$115,200.00
Myrinet F (gm/mx)9 $8,000.00
$24,000.00
$128,000.00
Myrinet E (gm/mx)10 $12,000.00
$36,000.00
$192,000.00
Myri-10G11 $9,600.00
$28,800.00
$153,600.00
Quadrics12 $13,073.00
$43,698.00
$205,538.00
Dolphin13 $7,800.00
NA
$140,160.00

Notes:

1 This assumes $26/NIC using the INTEL 32-bit PCI NICs, a basic 8-port GigE switch at $50, $320 for a 24-port switch (SMC GS16-Smart), and for 128-ports a Force10 switch that costs approximately $24,000.

2 This assumes $26/NIC using the INTEL 32-bit PCI NICs, a basic 8-port GigE switch at $50, $320 for a 24-port switch (SMC GS16-Smart), and for 128-ports a Force10 switch that costs approximately $24,000.

3 This assumes $26/NIC using the INTEL 32-bit PCI NICs, a SMC8508T 8-port GigE switch at $100, $320 for a 24-port switch (SMC GS16-Smart), and for 128-ports a Force10 switch that costs approximately $24,000.

4 This assumes $495/NIC (list price), a SMC8508T 8-port GigE switch at $100, $320 for a 24-port switch (SMC GS16-Smart), and for 128-ports a Force10 switch that costs approximately $24,000.

5 This assumes $795/NIC (list price). For an 8-port switch, I used an approximate price of $1,200/port (assuming a Fujitsu switch). For 24 ports, I assumed a price of $1,800/port using the new Quadrics 10 GigE switch. For 128-ports, I assumed a per port switch cost of $2,700 using the Force10 E1200 even though it's not a full line rate line card.

6 List prices obtained from Voltaire.

7 The list prices for the Infinipath NIC is $795. It uses IB switches to connect the NICs. For 8-port and 24-port pricing, the Voltaire 9024 switch was used with an approximate list price of $7,450. For 128 ports, a Voltaire ISR 9288 switch was used with an approximate list price of $106,100.

8 Average about $900 a node (interconnect + NIC).

9 Average about $1,000 a node (interconnect + NIC).

10 Average about $1,500 a node (interconnect + NIC).

11 List prices for Myrinet 10G are not available at the time of this writing. The price per port is an approximation assuming about $800 per NIC and $400 per switch port.

12 List prices obtained from Quadrics on-line configurator.

13 List prices obtained from Dolphin.


Jeff Layton has been a cluster enthusiast since 1997 and fights cluster crime in his spare time. As all good Cluster Monkeys, he has a cluster in his basement that he uses to perform evil experiments. He can be found swinging from the trees at ClusterMonkey.

    Search

    Login And Newsletter

    Create an account to access exclusive content, comment on articles, and receive our newsletters.

    Feedburner

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.