|
Page 2 of 3
Switching the Ether
GigE switches have now become less expensive. Switches can be
purchased in two varieties: managed and unmanaged. Small switches
are almost always unmanaged switches. That is, they are
plug-and-play with no option to provide any type of configuration.
A small
5-port switch
starts at about $32 with an
8-port switch
costing starting at about $52 and a
16-port switch
starting at $233. These switches typically do not support jumbo frames (MTU
up to 9000).
SMC has been selling a very
inexpensive line of unmanaged small switches for several years.
While these switches are very good in their own right, one of
the best features of these switches is that they are capable of
jumbo frames. (Note: The prices mentioned below were determined
at the time the article was posted.)
The
SMC 8505T
5-port switch costs about
$70, the
SMC 8508T
8-port switch costs about
$100,
and a 16-port version (8516T) starts at about
$300.
There is also a 24-port version of this new switch called the SMC 8524T that can be
found for about
$400.
The 8516T and 8524T switches are becoming more difficult to
find but there is a new line of 16 and 24 port unmanaged
switches that are capable of jumbo frames. The
SMC GS16-Smart
16-port switch costs about
$240.
A 24-port version of this switch, the
SMC GS24-Smart
can be found for about
$300.
There are a number of switch manufacturers that target the
cluster market by providing large high density/performance
managed switches. Managed switches allow users to monitor
performance or change settings. Companies such as
Foundry,
Force10,
Extreme Networks,
HP,
SMC,
and Cisco offer
large GigE switching solutions.
For clusters with Ethernet, having a single high performance backplane
(i.e. one large switch) can be important for good performance. A single
backplane can provide good "cross-sectional" bandwidth where all nodes
can achieve maximum throughput (if the switch is capable of line-rate
data transmission on all ports simultaneously). An alternative to a single backplane
is to cascade smaller and usually less expensive switches to provide
the requisite number of ports for your cluster. This is usually done
in a common topology such as a fat-tree topology. Depending upon the
topology, cascading switches can reduce the effective cross-sectional
bandwidth and almost always increases the latency. Several companies have
examples of large single backplane switches. Foundry has a switch
(BigIron RX-16) that
can have up to 768 GigE ports in a single switch. Extreme Networks
has a 480-port GigE switch
(Black Diamond 10808)
and Force10 has a monster 1260-port GigE switch
(Terascale E-series).
There are some interesting topologies you can use with Ethernet to
get higher performance and lower cost networks. The two most prominent
ones are
FNN (Flat Neighborhood Networks),
and SFNN (Spare Flat
Neighborhood Networks).
A FNN is a network topology that guarantees single-switch latency
and full bandwidth between per processing element for a variety of
communication patterns. The way this is achieved is that each node
has more than one NIC that are used to plug into smaller switches
in such a way that there is only a single link between certain
pairs of nodes. If you know how your code communicates you can
then use this knowledge to design a low-cost, high performance
FNN that uses inexpensive small switches to achieve the same
performance that you get from larger much more expensive switches.
The website even has an on-line tool for designing FNNs.
Sparse FNN's are a variation of FNNs but allow you to select which
communication patterns you want to ensure have single switch latency
and full bandwidth. They allow you to design a network for very specific
codes to get the best performance at the lowest cost.
GigE Message-Passing
Communication over GigE is usually done by using the kernel TCP
services. The result of this standard means a large number of MPI
implementations are available for GigE. Virtually every MPI
implementation available, either open-source or commercial, supports
TCP. There are a number of open-source implementations such as
MPICH,
MPICH2,
LAM-MPI,
Open MPI,
FT-MPI,
LA-MPI,
PACX-MPI,
MVICH,
GAMMA,
OOMPI (C++ only),
MP-MPICH,
and MP_Lite (useful subset of MPI).
There are also some major commercial MPI vendors. Verari Systems
Software (previously MPI Software Technologies) has
MPI/Pro and
Scali has
Scali MPI Connect,
Critical Software has
WMPI,
HP has HP-MPI,
and Intel
has Intel-MPI.
Also, since this is just plain TCP, we can also use PVM and other
message passing schemes may be used over GigE. In addition, many
storage protocols support TCP and in some cases native Ethernet.
For example,
iSCSI,
HyperSCSI,
Lustre,
ATA over Ethernet,
PVFS,
GPFS,
can be run over GigE.
Beyond GigE: 10 Gigabit Ethernet
It was quickly realized that even GigE would not give enough
throughput for our data-hungry world. Consequently, the development
of the next level, 10 Giga-bits per second, or
10 GigE,
was started. One of the key tenants of the development was to retain
backwards compatibility with previous Ethernet standards. This
new IEEE standard is 802.3ea.
It was also realized that there would have to be some changes to the
existing way Ethernet functioned to get the required performance.
For example, the IEEE standard has altered the way the MAC layer
interprets signaling. Now the signals are interpreted in parallel
to speed up the processing. However, nothing has been done to the
standard to limit backward compatibility.
For most 10 GigE installation, fiber optic cables are used to maintain
a 10 Gbps speed. However, copper is becoming increasing popular.
There are NICS and switches that support 10GBASE-CX4, which are
copper cables that use Infiniband 4x connectors and CX4 cabling.
These cables are currently limited to 15m in length.
However, there are new 10 GigE cables being developed. In approximately
August 2006, there should be 10GBASE-T cables available. They use
unshielded twisted-pair wiring, the same as cat-5e or cat-6 cables.
This is the proverbial "Holy Grail" of 10 GigE. Having just the cables
will be somewhat pointless though if the NICs and the switches are not
available as well. So we should see a flurry of new product announcements
during 2006.
10 GigE NIC Vendors
There are several vendors currently developing and selling 10 GigE
NICs for clusters. Currently there are
Chelsio and
Neterion,
formally S2IO,
Myricom, and
Intel
providing 10 GigE NICs.
Chelsio markets two PCI-X
intelligent 10 GigE NICs that include RDMA support that are reasonable
to use for the HPC market. The
T210 uses fiber
optic cables and comes in both an SR and LR version (single fiber mode
and multi fiber mode) for 64-bit PCI-X.
This NIC includes both a TOE capability (TCP Off-Load Engine) and a
RDMA (Remote Direct Memory Access) capability to improve the performance
of the NIC as much as possible. Since the NIC uses TCP as the network
protocol, any MPI implementation that uses TCP, which virtually all of
them do, will work with these NICs without change. More over, the Chelsio driver was
the first 10 GigE driver to be included in the Linux kernel.
Chelsio also has a copper connector version, the
T210-CX. It is a
64-bit PCI-X NIC that uses CX4 copper cables. It has the same features
of the T210 including TOE and RDMA capability.
Chelsio also sell a "dumb" 10 GigE NIC,
N210, that
that does not include RDMA nor TOE capabilities. It uses fiber optic
cables, both SR and LR, for connecting to switches or to other NICs.
It is cheaper than the T210 series, but likely have the same level of
performance for HPC codes.
Neterion has a 10 GigE NIC, called the
Xframe. The NIC
is a 64-bit PCI-X NIC and has RDMA and some TOE capability and uses
fiber optical cabling.
Neterion has also announced a new 10 GigE NIC called
Xframe II.
This NIC has a 64-bit PCI-X 2.0 interface that should allow a bus
speed of 266 MHz instead of the usual 133 MHz.
According to the company this NIC should be capable of hitting 10 Gbps
wire-speed (PCI-X currently limits 10 GigE NICs to about
6.5-7.5 Gbps) and achieve a 14 Gbps bi-directional bandwidth.
Currently both the Xframe and Xframe II NICs use optical fiber
connectors, presumably both SR and LR. However, with the desire for
copper connectors, it is entirely possible they have a CX4 version
of the NIC. Neither NIC is directly sold to the public but is
sold to OEM's. Recently IBM has announced that they will use Neterion
NICs in their xSeries servers that use Intel processors.
Myrinet has a new NIC that has some interesting features. It can function
as a normal 10 GigE NIC if it's plugged into an Ethernet switch. Or it can
function as a Myrinet NIC using the MX protocol when plugged into a
Myrinet switch (Holy Sybil Batman!). This new NIC, called the
Myri-10G, can
accommodate a number of connectors including 10GBase-CX4,
10GBase-{S|L}R, and XAUI/CX4 over ribbon fiber. The cards are PCI-Express
x8 and start at $795 a NIC (list price). Myricom has Linux, Windows,
and Solaris drivers that come bundled with the card and FreeBSD has
a driver for it in it's source tree. Myricom reports that they have
been shipping the NICs since Dec. 2005.
Intel has developed and is selling
three 10 GigE NICS: the Intel
Pro/10GbE CX4 NIC, Intel Pro/10GbE SR NIC, and the Intel Pro/10 GbE LR
adapter. The CX4 version is a PCI-X NIC that uses copper cabling.
When this article was written, the best pricing for it was about
$872.
The SR version is also a PCI-X NIC that uses multi-mode fiber cables
for connectivity. It is intended primarily for connecting enterprise
systems and not for HPC. Currently it costs about
$3,000
per NIC. Finally, the LR version of the NIC, which is also a PCI-X
NIC, is for long-range connectivity (up to 10 km) using single-mode
fiber cables. As with the LR NIC, it is not really designed for
HPC and it's price is about
$5,000.
It is likely that other vendors will be developing 10 GigE products
in the near future. Level 5 and other RDMA GigE manufacturers are
rumored to be developing a 10 GigE product.
10 GigE MPI
Since 10 GigE is still Ethernet and TCP, you can use just about
any MPI implementation, commercial or open-source, as long as it
supports TCP or Ethernet. This means that you can run existing binaries
without any source changes. This reason is a why people are seriously
considering 10 GigE as the upcoming interconnect for HPC.
10 GigE Switches
There are several 10 GigE switch manufacturers. The typical HPC switch
vendors such as Foundry, Force 10, and Extreme all make 10 GigE
line cards for their existing switch chassis. They have been developing
these line cards primarily for the enterprise market, but the now
realize that as the costs come down on the line cards and the NICs,
that they may have a product line suitable for the HPC market.
Foundry has a new large chassis
(14U) switch, called the
RX-16.
It can accommodate up to 16 line cards. Foundry currently has a 4-port
10 GigE line card. They have a fiber optic version of this line card and,
presumably, a copper version using CX4 cables. All ports on the switch
run at full line rate and have an approximate per port cost of $4,000.
A company that that has focused on 10 GigE for some time is
Extreme Networks.
They have a
BlackDiamond
series of switches that focus on high performance, including HPC.
Their largest switch, the
BlackDiamond 10808,
can accommodate up to 48 ports of 10 GigE, presumably both fiber and
copper.
Force10 has probably
been the leader in the 10 GigE market. They have a large single chassis
switch, called the E Series
that can accommodate up to 224 10 GigE ports. To reach this port count,
they have a new line
card that can accommodate up to 16 ports of 10 GigE (a total of 14 line
cards). However, these new line cards are not
full line rate cards.
To get full line rate on each port, they have an 4-port 10 GigE line
card, resulting in 56 total 10 GigE ports. An interesting difference
between the cards, beside the performance, is the price. The 16-port
line cards result in a per port cost of about $2,700, while the 4-port
line cards result in a per port cost of about $7,500. Both line cards
have plug-able XFP optics allowing SR, LR, ER, and ZR optics to be used.
The switches from Extreme, Force10, and Foundry focus on the high end of
10 GigE with large port counts. However, other companies are focusing on
lower port count 10 GigE switches that deliver good performance.
Fujitsu has a single-chip
10 GigE switch called the
XG700
that has 12 ports in a compact form factor. Also, the switch can be configured
for SR, LR, and CX4, connections. It has a very low latency of 450 ns and
has a reasonably lost cost of about $1,200 per port.
SMC has long been a favorite of cluster
builders for small to medium clusters. Their switches have very good
performance and they have a wide range of unmanaged and managed
switches. Recently they brought out an inexpensive 8-port switch, the
SMC 8708L2.
It is an 8-port single backplane switch that use XFP connectors that
support SR, LR, and ER XFP.
It is a managed switch, but at press time one of these
switches was about $6,300.
That comes out to less than $800/port. This is the price/performance leader
for small 10 GigE fiber switches.
Quadrics
introduced a new 10 GigE switch that uses the Fujitsu single-chip
10 GigE solution. At the Supercomputer05 show, they were showing a
new 10 GigE switch
that fits into an 8U chassis. It has 12 slots for 10 GigE line cards.
Each line card has 8-ports for 10 GigE connections using CX4 connectors.
The remaining four ports for each line card are used to internally
connect the line cards in a fat tree configuration. This means that
the network is 2:1 oversubscribed but looks to have very good performance.
If all line card slots are populated, then the switch can have 96 ports.
It has been in testing since Q1 of 2006 and fille production is slated
for mid 2006. No prices have been announced,
but the rumors are that the price should be below $2,000 a port.
Quadrics also stated in their press release that follow-on products
will increase the port count to 160 and then to 1,600. Further
announcements on this produce will be at Interop.
Even more exciting is a new company,
Fulcrum Micro, that is
developing a new
10 GigE switch ASIC.
It has great performance with a
latency of about 200 ns and uses cut-through rather than store-and-forward
for improved latency and throughput. It can accommodate up to 24 ports
and should be available in Jan. 2006 for about $20/port. Fulcrum has a
paper that talks about how to take the 24-port 10 GigE switches that use
their ASIC and construct a 288-port fat-tree topology with full bandwidth
to each port and a latency of only 400 ns. According to Fulcrum, a number
of companies are looking at using their ASICs to build HPC-centric
10 GigE switches.
Infiniband
Infiniband
was created as an open standard to support a high-performance
I/O architecture that is scalable, reliable and efficient. It was
created in 1999 by the merging of two projects: Future I/O supported
by Compaq, IBM, and HP, and Next Generation I/O supported by Dell,
Intel, and Sun.
The reason for the drive to a new high-performance I/O systems was
that the existing PCI bus had become a bottleneck in the computing
process. It was hoped that updating PCI to something new would allow
the bottleneck to be removed.
Much like other standards, IB is a standard that can be implemented
by anyone. This freedom has the potential to allow for greater competition.
Today there are four major IB companies: Mellanox, Topspin (acquired
by Cisco), Silver Storm (was Infinicon), and Voltaire. However,
Mellanox is the main manufacturer of Infiniband ASIC.
The Infiniband specification that was finally ratified, provides for
a number of features that improve latency and bandwidth for interconnects.
One of these is that IB is a bidirectional serial bus. This reduces
cost and can improve latency. The specification also provides for the
NICs (usually called a HCA - Host Channel Adapter) to use RDMA. This
greatly improves latency. Equally important, the specification provides
an upgrade path for faster interconnect speeds.
As with other high-speed interconnects, IB does not use IP packets.
Rather, it has it's own packet definition. However, some of the IB
companies have developed an 'IP-over-IB' software stack, allowing
anything written to use IP to run over IB albeit with a performance
penalty compared to native IB.
The specification starts IB at a 1X speed which allows for an IB link
to carry 2.5 Gbps (giga-bits per second) in both directions. The next
speed is called 4X. It specifies that data can travel at 10 Gbps (however
PCI-X limits this speed to about 6.25 Gbps). The next level up is
12X which provides for a data transmission rate of 30 Gbps. There are
also standards that allow for Double Data Rate (DDR) transmissions which
transfer twice the same amount of data per clock cycle, and for Quad
Data Rate (QDR) transmissions that transfer 4 times the amount of data
per clock cycle. For example, a 4X DDR NIC will transfer 20 Gbps
and a 4X QDR NIC will transmit 40 Gbps.
Like many other networks, IB is a switched network. That is, the HCAs,
connect to switches that are used to transmit the date to the other
HCAs. A single chassis switch can be used or the switches can be
connected in some topology. Today there are a wide variety
of switches from the major IB companies.
Voltaire is a privately
held company focusing on IB for HPC in
addition to other areas in need of high-speed networking. They were
the first of the IB companies to market a large
288-port IB switch.
The switch uses 14U of rack space and provides full 4X SDR (Single Data
Rate) bandwidth to
each switch port. Alternatively, this switch can accommodate up to 96 ports
of 12X Infiniband. Voltaire also ships a small (1U) IB switch, the
9024 that provides
up to 24 ports of 4X Infiniband (DDR capable). They also have a cool
product, the
ISR 6000 that allows
Infiniband based networks, like those in clusters, to be connected
to Fibre Channel or TCP networks.
Silver Storm (previously
called Infinicon) was founded in 2000 and privately held, Silver Storm sells an
HCA
in both PCI-X and PCI-Express form factors, IB
switches, and all of the support infrastructure for IB. They currently
sell
IB switches
as large as 288 ports in a single chassis. Silver Storm also uses their
own IB software stack, called
Quick Silver
that has a reputation for very good performance, reliability, and
easy of use.
Silver Storm specializes in multi-protocol switches that allow different
types of connections, such as 4X IB, GigE, and Fibre Channel, to all
use the same switch. They are also focusing on Virtual I/O that
allow you to aggregate traffic from SAN, LAN, and server interconnects
into a single pipe. This allows you to take 3 different networks and
combine them into a single network connection.
Topspin, which is now
part of Cisco, is also pursuing the high-performance computing market as
well as other markets that can use the high performance IB interconnect
including Grid Computing and database servers. Topspin is producing IB
products to combine CPU communication as well as IO communication. They
are also shrinking the size of IB switches. They have a 1U switch, called
the
Topspin 90, that
has up to 12 ports of 4X Infiniband and also up to two 2 Gbps Fibre
Channel ports and six GigE ports to allow the network to be connected to
a range of other networks, such as storage networks. They also sell
a PCI-X and PCI-Express
HCAs that have
two IB ports.
Mellanox was founded in 1999 and
develops and markets IB ASIC's
(chips), IB HCA's, switches, and all of the software for controlling
an IB fabric. There first product was delivered in 2001 shortly after
the IB specification came out. Mellanox does not sell directly to the
public. Rather they sell to other IB companies and to vendors such as
cluster vendors.
Currently, Mellanox has a wide range of
HCA cards.
The Infinihost III Ex
cards fit into a PCI-Expresss x8 slot and come in SDR (10 Gbps) and
DDR (20 Gbps) versions. It comes in two versions, one that has memory
on the HCA, and one that does not (called MemFree) that uses the host
memory. There is also an HCA, the
Infinihost III Lx,
that only uses the MemFree capability that uses the host memory instead of HCA memory. It also
comes in 4X SDR (10 Gbps) and DDR (20 Gbps) versions and uses a PCI-Express
interface, either as x4 (SDR) or x8 (DDR).
Mellanox has also been
an active participant in the development of the OpenIB Alliance software
project. The OpenIB Alliance software stack has been included in the
2.6.11 kernel. They are also participating in the development of what
is called "OpenIB Gen 2," which is the next generation IB stack for
the Linux kernel.
There are a number of commercial MPI implementations that support IB.
Infinicon ships an MPI library with their hardware. Scali MPI connect,
Critical Software MPI, Intel's MPI-2 and Verari System Software
(previously MPI Software Technologies) MPI/Pro
works with IB hardware.
There are also a number of open-source MPI implementations that now
support IB. For example LAM-MPI, Open MPI, MPICH2-CH3, MVAPICH, and
LA-MPI all support various IB stacks.
|