|
Page 3 of 3
Infinipath
Infinipath is a new interconnect technology developed by Pathscale. It
is a variation on IB taking advantage of the Hypertransport bus in
Opteron systems and, soon, even PCI-Express systems. In
addition to implementing standard InfiniBand protocols, this technology
enables accelerated software stacks that have very low latency and
high messaging rates.
Pathscale was founded in 2001 by a team of experienced engineers
and scientists from Sun, SGI, and HP. In
April 2006, PathScale was acquired by QLogic, a well-known maker of
storage equipment such as fibre channel HBAs and switches.
The focus of Pathscale is on
software and hardware solutions that enable Linux clusters to achieve
new levels of performance and efficiency. PathScale has 2
flavors of InfiniPath adaptors, one which plugs into the AMD Opteron
HyperTransport bus, and one (shipping in late May) which plugs into
PCI-Express bus. The NICs are then connected using standard IB switches
and cables.
This combination is unique for several reasons. First it uses standard
IB switches that reduces costs compared to developing a new switch.
Second, in the case of the NIC that plugs directly into the Hypertransport
bus, a factor of 2 to 3 times lower latency than other IB NICs can be
achieved. And third, as
you increase the speed of the Opteron processors, the latency of the
Infinipath NIC will continue to decrease.
InfiniPath adaptors work with all InfiniBand switches currently
available. InfiniBand compliance is achieved by using the OpenFabrics
(formerly OpenIB) software stack. The InfiniPath device driver is part of
the kernel.org Linux kernel (as of 2.6.17). In addition to
OpenFabrics software, PathScale distributes an accelerated MPI stack
based on MPICH-1. This MPI stack, combined with the HyperTransport HTX
adaptor, achieves 1.29 microsend latency and more than 8 million
messages/second on an 8-core system.
As mentioned Pathscale ships an accelerated MPI for Infinipath. It is based on MPICH
1.2.6 with performance enhancements for Infinipath. Other open-source MPI
implementations can support Infinipath using the OpenFabrics software
stack. Pathscale also has an IP over Infinipath capability for its accelerated
stack. Thus any IP based protocol
package should work (PVM or TCP/IP based MPIs), albeit with reduced
performance.
As you will see in the performance table at the end of this article,
Infinipath is the fastest performing interconnect available for clusters
today. It has the lowest latency and the smallest N/2 (defined later).
Consequently, it's performance on many HPC codes is very good.
Myrinet
Myrinet was one of the first high-speed networks developed with
clustering in mind. Myricom is a privately held company started
in 1994 by Charles L. Seitz. Some of the initial technology was
developed under two ARPA sponsored research projects. Myrinet became
an ANSI standard in 1998 and is now one of the most popular high-speed
cluster interconnects, being the dominant non-Ethernet interconnect on
the 26th Top500 list.
Myrinet is used by one of the fastest clusters in the world,
MareNostrum, in Spain.
Myricom has two product lines. Their existing product line, called
Myrinet-2000 or Myrinet-2G, has been in successful production use since
2000. Their new product line, called Myri-10G, started shipping at the
end of 2005. Myri-10G is based on 10-Gigabit Ethernet at the PHY layer
(layer 1), and is dual-protocol Ethernet and Myrinet at the Data Link
layer (layer 2).
Myrinet - 2G
Myricom currently has three NICs in the 2G line. All of them use fiber optic
connections. The 'D' card is the lowest price card with a 225
MHz RISC processor and a single fiber port. The newer 'F' card uses a
333 MHz RISC processor with a single fiber port.
The 'E' card has two NIC ports and uses a 333 MHZ RISC processor.
All three NICs are 64-bit PCI-X based cards that are "short"cards
(not full length), and are low-profile.
Myrinet 2G is a switched network. The network is based on a Clos design
that uses small switch elements to build larger switches. Myrinet currently
uses 16-port switches as the basic building block.
Myrinet 2G network switches come in several sizes. They have a 2U switch
chassis that can accommodate 8-port and 16-port models.
For medium size networks, they have a 3U switch chassis for 32-ports,
a 5U switch chassis for up to 64-ports, and a 9U switch for up
to 128 ports. For larger networks they have a single
14U switch with up to 256 host ports. For even larger networks, you can
connect the switches using 'spine' cards creating a Clos network giving
full bisection bandwidth to each port. This gives a great deal of
flexibility when designing a network topology.
Since Myrinet is focused on clusters, it has taken advantage of the
fact that it is not tied to compatibility with general purpose networks
such as Ethernet, TCP, and IP. Consequently, it has made changes to improve
networking performance, specifically using a different protocol than TCP.
Current protocols for the Myrinet 2G line include GM and MX. They use
simpler packets than TCP, resulting in better usage of the packets (less
overhead). These packets can be of any length so they can contain
packets of other types, such as IP packets. In addition, the MX protocol
has been redesigned so that it has about half the zero-packet latency of
the GM protocol. To further improve performance, Myrinet also uses an
OS-bypass like interface to help latency and reduce CPU overhead.
The data packets on Myrinet are source routed which means that each host
must know the route to all of the other hosts through the switch fabric.
The result is that the NICs do most of the work and the switches can be
very simple. Since the switches are not doing much of the work, each NIC
must know the full network topology to route the data and the topology must
be fairly static.
There are several MPI's available for Myrinet. Myricom uses an open-source
MPI called MPICH-GM for the GM protocol and MPICH-MX for the MX
protocol. Both are based on MPICH. Open MPI supports GM and will support
MX. LAM-MPI support GM. Also since Myrinet can
run TCP over GM, you can use any MPI that uses TCP with Myrinet. There is a
small performance penalty for running in this manner.
There are several commercial MPI implementations that support Myrinet
with GM. Verari System Software (previously MPI Software Technologies)
has MPI/Pro and Scali has Scali MPI Connect.
Myri-10G
The Myri-10G product is a new product line from Myricom. It takes
the bandwidth up to 10 Gbps, the same as 4X Infiniband and 10 GigE.
However, the 10G NIC is different than most NICs because it can
take on different personalities depending upon which switch it is plugged
into. It you plug it into a Myricom switch, it will use MX as the
protocol. If you plug it into a 10 GigE switch, it will talk TCP.
The 10G NIC is PCI-Express only (sorry PCI-X world). Right now it
uses 8 lanes (x8 in PCI-Express talk). As with the 2G NICs, it
includes a processor and firmware. It does the network protocol
processing and also uses OS-bypass to improve performance.
The NICs are 10Gbase-CX4, 10GBase-R, or XAUI over ribbon fiber. The
copper 10Gbase-CX4 cables can go up to 15m in length and the 10Gbase-R
serial fiber cables are good for 10 GigE.
Since the Myrinet design philosophy puts the intelligence in the
NIC, they can use the same basic switch concepts for 10G. So they will
have Clos networks with full bisection bandwidth to each port.
Initially 16-port, 128-port switches. In 1H 2006 will have 256-port
switch.
Quadrics
Quadrics is another company that
has focused on on very high performance
interconnects for clusters. The Quadrics R&D team grew out of the Meiko
supercomputer company. The company was incorporated in 1996 in Europe
as part of the Finmeccanica Group. It is headquartered in Britain with
regional offices in Italy and the US. It's networking technology,
QsNet
and the newest version, QsNetII, are focused on providing the
lowest latency possible with the largest bandwidth.
In many ways, Quadrics interconnect technology is similar to other
high-speed networking technologies but Quadrics is one of the best performing
cluster interconnects on the market. The intelligent Quadrics NICs use
an OS-bypass capability to help reduce latency. The NICs also use
RDMA concepts to reduce copying. This helps both latency and bandwidth.
Quadrics currently is selling two NICs and several switches. The QM400
NIC is a 64-bit 66 MHz PCI NIC and the QM500 is either a 64-bit or 128-bit
133 MHz PCI-X
NIC for QsNetII. The QM500 is a new card that has faster on-board
processors and uses less power than the QM400. However the QM400 card is
still one of the fastest cards for a simple PCI bus.
Quadrics is a switched network like most of the networks in this survey.
The basic topology of Quadrics (QsNetII) is to connect the NICs with
copper cables (up to 13m in length) or fiber cables for longer distances,
to 8 port switch chips that are arranged
in a fat-tree configuration. They have a 128-port switch and a 16-port
switch for QsNet. These switches can be linked together to form large
switching networks in a fat-tree topology so each port has full bisection
bandwidth. For QsNetII they have a range of stand-alone switches
(E-series: 8-port, 32-ports, 128-ports) and an R-series network of federated
switches scalable to up to 4,096 ports.
Quadrics provides an MPI implementation for QsNet and QsNetII. The MPI
is based on MPICH from Argonne Labs and is built in top of the low
latency libraries written by Quadrics. It also has shmem capability for
SMP nodes.
Verari System Software (previously MPI Software Technologies) has a
commercial MPI, ChaMPIon/Pro, that supports Quadrics. HP-MPI, Intel-MPI,
Scali-MPI are also used on Quadrics connect systems.
Like other vendors, Quadrics provides TCP/IP capability on their
networks allowing TCP based MPIs and PVM to work as well. The TCP/IP
layer has also been used for Lustre and the PVFS2 file systems. Finally,
Quadrics offers the Quadrics Resource Management System (RMS) which
provides a high level of parallel supercomputing facilities for UNIX and Linux
working on top of QsNet.
Dolphin
Dolphin Interconnect Solutions
develops, markets, and supports SCI
(Scalable Coherent Interface) networking for clusters. Dolphin began
as a group of visionary engineers at the Norwegian computer maker
Norsk Data. Together with engineers from a number of US computer
makers, the SCI (Scalable Coherent Interface) standard was developed
and made an official IEEE/ANSI standard in the spring of 1992.
Dolphin Interconnect Solutions was formed in the autumn of 1991 to
focus on SCI markets and products. During the period from 1992 to
1996, the core SCI technology was developed and clusters using SCI
began to appear. Dolphin Interconnect Solutions is based in Oslo, Norway
with sales representation around the world. Product and technology
development is conducted by the technical staff in Oslo, in close
collaboration with a network of partners, consultants, universities
and research institutes.
Dolphin ships single, dual, and triple-port SCI cards in PCI form factors
and a dual-port PCI-Express card. SCI is a switchless network. The NIC's
are connected to one another in some network topology. For example,
you can make a simple ring topology, or a 2D torus topology, or even
a 3D topology (i.e. by using multiple ports on the NICs, you create
on dimension, two dimensional, and three dimensional topologies). For
a switched network, you have to worry about how many open ports you
have. If you need more, you have to buy another switch or a bigger
switch. To expand an existing SCI system, it usually just involves
plugging in the NIC into the network in the appropriate fashion.
However, a downside is that if there is a downed node in the network,
the links to the node must be routed, thus impacting messaging in the
remaining system. On the other hand, it avoids a single point of failure
of typical switched networks.
The chips on the NICs handle all of the routing so that a packet, if it
is not intended for that node, has no impact on the CPU at all.
Consequently, the NICs are very efficient at routine packets to the
neighboring nodes.
The flexibility of network design is a big plus for SCI. To design a
SCI network, you must choose the layout carefully to avoid saturation.
Typically, 8-10 node will saturate a simple SCI ring, 64-100 nodes will
saturate a 2D SCI torus, and 640-1000 nodes will saturate a 3D SCI torus.
There are several MPI implementations available for SCI. There is an
open-source version based on MPI 1.2 from MPICH called SCI-MPICH. There
is also an open-source MPI based on MPICH2 called SCI-MPICH2.
Dolphin offers an open source package called Dolphin SuperSockets. It
provides the necessary transparrent glue to boost any available MPI or PVM
implementations. It has been tested with MPICH2, LAM-MPI, PVM, and even
Lustre, GFS, and iSCSI.
Dolphin also sells a package called Dolphin Sockets. It provides the
necessary transparent glue underneath MPI and PVM implementations. It has been
tested with MPICH2, LAM-MPI, PVM, and even Lustre, GFS, and iSCSI. Scali
also supplies a commercial MPI implementation that works with SCI.
There Is More, But
There are other network technologies we did not mention, but decided to
cover the "mainstream" products and technology that are used by
the HPC cluster market. Please contact us if you think there
is something we missed.
Summary
It's always nice to have a table of data to compare various things.
I've created a table to compare the various interconnects I've talked
about. In particular, the first table lists latency in microseconds,
bandwidth in Mega-bits per second (Mbps), and the N/2 packet size.
The second table lists the cost for 8 nodes, 24 nodes, and 128 nodes.
The N/2 packet size is the size of the packets in bytes that reach
half the bandwidth of the interconnect. It is important because it
tells if small packets get good bandwidth performance. While latency
is important for some codes, other codes depend heavily on bandwidth.
Having good bandwidth for small packet sizes is good for these types
of codes.
I have made every effort to provide accurate numbers. The information
in Table One has either been provided by the vendors or obtained from
references on the web.
I want to emphasize that Table One is designed to give you a "ball park"
comparison of each technology. You will never be able to predict how
well your application(s) work from examining the table. We recommend
that you contact the vendors or integrators and discuss your needs
with them. Table One can be considered a guide to your discussions.
Indeed, the latency and bandwidth were not likely obtained in the exact
same manner from each vendor. You should be careful in comparing them
to one another.
In particular, using this table to select an interconnect is not a good idea.
The table is intended as a 50,000 foot look at cluster interconnect technologies. The performance (and the devil) is in the details. The final selection of an
interconnect should be done by testing your codes or a set of benchmark
codes that you can correlate to your codes.
Table One - Interconnect Summary: Performance Metrics
| interconnect |
Latency (microseconds) |
Bandwidth (MBps) |
N/2 (Bytes) |
| GigE |
~29-120 |
~125 |
~8,000 |
| GigE: GAMMA |
~9.5 (MPI) |
~125 |
~9,000 |
| GigE with Jumbo Frames |
29-120 |
~125 |
~8,000 |
| GigE: Level 5 |
15 |
104.7 |
NA |
| 10 GigE: Chelsio (Copper) |
9.6 |
~862 |
~100,000+ |
| Infiniband: Mellanox Infinihost (PCI-X) |
4.1 |
760 |
512 |
| Infiniband: Mellanox Infinihost III EX SDR |
2.6 |
938 |
480 |
| Infiniband: Mellanox Infinihost III EX DDR |
2.25 |
1502 |
480 |
| Infinipath: HTX |
1.29 |
954 |
214 |
| Infinipath: PCI-Express |
1.62 |
957.5 |
227 |
| Myrinet D (gm) |
~7.0 |
~493 |
~1,000 |
| Myrinet F (gm) |
~5.2 |
~493 |
~1,000 |
| Myrinet E (gm) |
~5.4 |
~493 |
~1,000 |
| Myrinet D (mx) |
3.5 |
~493 |
~1,000 |
| Myrinet F (mx) |
2.6 |
~493 |
~1,000 |
| Myrinet E (mx) |
2.7 |
~493 |
~1,000 |
| Myri-10G |
2.0 |
1,200 |
~1,000 |
| Quadrics |
1.29 |
~875-910 |
~576 |
| Dolphin |
4.2 |
457.5 |
~800 |
Table Two is a cost comparison for various cluster sizes and interconnect technologies. Aside from direct price comparisons, it was also created to illustrate how cost scales with cluster size. Please consult the foot notes for the assumptions. Finally,
price is often related to performance (i.e. low price may equal low performance). Contact the companies listed in the article for more information about your HPC needs.
Table Two - Interconnect Summary: Pricing
| Interconnect |
8 Node Cost |
24 Node Cost |
128 Node Cost |
| GigE1 |
$258.00 |
$944.00 |
$27,328.00 |
| GigE: GAMMA2 |
$258.00 |
$944.00 |
$27,328.00 |
| GigE with Jumbo Frames3 |
$308.00 |
$944.00 |
$27,328.00 |
| GigE: Level 54 |
$4,060 |
$12,200 |
$83,360.00 |
| 10 GigE: Chelsio (Copper)5 |
$15,960.00 |
$62,280.00 |
$447,360.00 |
| Infiniband: Voltaire6 |
$11,877.00 |
$23,084.00 |
$182,083.00 |
| Infinipath7 |
$13,810.00 |
$26,530.00 |
$207,860.00 |
| Myrinet D (gm/mx)8 |
$7,200.00 |
$21,600.00 |
$115,200.00 |
| Myrinet F (gm/mx)9 |
$8,000.00 |
$24,000.00 |
$128,000.00 |
| Myrinet E (gm/mx)10 |
$12,000.00 |
$36,000.00 |
$192,000.00 |
| Myri-10G11 |
$9,600.00 |
$28,800.00 |
$153,600.00 |
| Quadrics12 |
$13,073.00 |
$43,698.00 |
$205,538.00 |
| Dolphin13 |
$7,800.00 |
NA |
$140,160.00 |
Notes:
1 This assumes $26/NIC using the INTEL 32-bit PCI NICs,
a basic 8-port GigE switch at $50, $320 for a 24-port switch
(SMC GS16-Smart), and for 128-ports a Force10 switch that costs
approximately $24,000.
2 This assumes $26/NIC using the INTEL 32-bit PCI NICs,
a basic 8-port GigE switch at $50, $320 for a 24-port switch
(SMC GS16-Smart), and for 128-ports a Force10 switch that costs
approximately $24,000.
3 This assumes $26/NIC using the INTEL 32-bit PCI NICs,
a SMC8508T 8-port GigE switch at $100, $320 for a 24-port switch
(SMC GS16-Smart), and for 128-ports a Force10 switch that costs
approximately $24,000.
4 This assumes $495/NIC (list price),
a SMC8508T 8-port GigE switch at $100, $320 for a 24-port switch
(SMC GS16-Smart), and for 128-ports a Force10 switch that costs
approximately $24,000.
5 This assumes $795/NIC (list price). For an 8-port switch,
I used an approximate price of $1,200/port (assuming a Fujitsu
switch). For 24 ports, I assumed a price of $1,800/port using the
new Quadrics 10 GigE switch. For 128-ports, I assumed a per
port switch cost of $2,700 using the Force10 E1200 even though
it's not a full line rate line card.
6 List prices obtained from Voltaire.
7 The list prices for the Infinipath NIC is
$795. It uses IB switches to connect the NICs. For 8-port
and 24-port pricing, the Voltaire 9024 switch was used with an
approximate list price of $7,450. For 128 ports, a Voltaire
ISR 9288 switch was used with an approximate list price of
$106,100.
8 Average about $900 a node (interconnect + NIC).
9 Average about $1,000 a node (interconnect + NIC).
10 Average about $1,500 a node (interconnect + NIC).
11 List prices for Myrinet 10G are not available at the
time of this writing. The price per port is an approximation
assuming about $800 per NIC and $400 per switch port.
12 List prices obtained from Quadrics on-line configurator.
13 List prices obtained from Dolphin.
Jeff Layton has been a cluster enthusiast since 1997 and fights
cluster crime in his spare time. As all good Cluster Monkeys, he has
a cluster in his basement that he uses to perform evil experiments.
He can be found swinging from the trees at ClusterMonkey.
Comment on this article
You must login to leave comments...
Other Visitors Comments
There are no comments currently....
|