SC05 Wrapup - No Sleeping in Seattle

Introduction

The SC conference is always a lot of fun because there are so many cool new things at the show, you get to see people that you have only emailed, you get to see old friends, and you get to "geek out" without too much grief from your family. This year's show was no exception. It was the largest SC conference ever and had lots of new announcements and even included a large presence from Microsoft. Of course, Cluster Monkey has already commented about this turn of events.

This year's SC was a good show. I didn't get to any of the presentations, but I did try to see as much of the show floor as I could. So I want to share with you what I saw and what I learned in talking with the various vendors. However the show floor was huge so I may have to rely on press releases to get me over the hump.

In addition to this summary, which is by no means a complete synopsis of the show, check out Joe Lanman's Blog as well.

SC05 - Location, Location, Location

This year's SC Conference was at the Washington State Convention and Trade Center in Seattle Washington. It was a great location for the conference although the exhibition hall had to be split into two parts to accommodate all of the vendors. However I may be a little prejudice since I used to live in Seattle.

The Convention Center is located just a few blocks up from Pike Place Market and it is literally surrounded by coffee places, particularly Starbucks. I swear that you can't walk 20 feet without running into a coffee place. On the other hand, I don't like Starbucks because it tastes like they've burned their beans. Anyway, there are a number of hotels in the area, plenty of places to eat, and, more importantly, lots of watering holes that serve the best of the local micro-breweries (never mind Dungeness Crab).

Monkey Get Together

The First Monkey Get Together was a huge success. A number of people showed up and all of the hats were given out (although we had to save one for rgb. So if you see someone with a yellow hat with a monkey on it on the Duke campus, introduce yourself! rgb is a great guy.). I got to see some old friends like Roger Smith, Joey, and Trey from the ERC at Mississippi State, Dan Stanzione of Cluster Monkey-famedom, Glen Otero - International Man of Mystery and Super Cluster Monkey (ladies, he's still single and still a body builder), and others. I also got to meet some new friends like Josip Loncaric. Josip was an early major contributor to clusters. He made a small change to to the TCP stack in the 2.2 kernel, greatly improving the TCP performance. He now works at Los Alamos on aspects of clusters and high performance computing. It was a real honor to meet him and to talk to him (a little hero worship going on there).

I also spent some time talking to Dimitri Mavriplis, who is a professor at the University of Wyoming. He is one of the best CFD (Computational Fluid Dynamics) researchers in the world. It was great fun to talk about CFD with him since that's one of my interests, as well as clusters (he uses clusters in his research). If you are looking for CFD codes for your clusters, Dr. Mavriplis is the man to talk to.

I think the Monkey Get Together was a big success. It was very nice to see such a ground swell of support for clusters, particularly beowulfs, from such a cross-section of the community. There were various people there from competing companies but they can discuss the state of clusters and the future of clusters in a constructive and passionate way.

Linux Networx Announcements

I'd like to talk about the Linux Networx announcements at the conference, but I need to disclose that I work for Linux Networx so you can view this as a shameless plug if you like. However, even if I didn't work for Linux Networx, I would still write about their announcements and you'll see why in the following paragraphs.

Linux Networx introduced two new clusters: the LS-1 and the LS/X. These two systems represent a new approach to clusters - bringing them to the systems level. Doug as mentioned in one of his recent writings that clusters were a disruptive influence on HPC at many levels, one of them being "disruptive support." Doug went on to say that, "... . There are integrated clusters from larger vendors that reduce the number of user options in order to increase the level of performance, integration, and support. ..." This is precisely what Linux Networx has done. The key concept is to take a systems approach to clusters and make them easier to use, easier to manage, easier to support, and easier to upgrade. Both LS-1 and LS/X embody this philosophy.

Full-Height LS-1 and Half-Height LS-1. Courtesy of Linux Networx

Full-Height LS-1 and Half-Height LS-1. Courtesy of Linux Netwrx

LS-1

The LS-1 has been designed based on the years of experience Linux Networx has with clusters using the "best of breed" components and processes. The LS-1 is designed for the small to medium range market with up to 128 nodes. The current LS-1 system is Opteron only with dual socket nodes that are dual-core capable. You can also choose to have a GigE network, Myrinet 2G network, or an Infiniband network (Infinipath is coming around 1Q of 2006). There are also a number of storage options that range from simple NFS boxes to parallel file systems with great IO performance. At SC05 there was also a technology demo of a parallel visualization capability for the LS-1. Linux Networx is working very hard on visualization. To give you a little insider information, I think the resulting visualization product will be really neat and cost much less than the equivalent SGI visualization equipment (Not that I'm biased or anything).

LS/X

The LS/X is designed for the upper range of supercomputer performance. It uses a mid-plane architecture where the boards slide into an 8U sub-rack (I guess you can call them blades). Linux Networx is currently shipping a 4-socket Opteron node (dual-core capable) with two built-in Infinipath NICs, two GigE NICs, and up to 64 GB of memory. For each 4-socket node there are also two bays at the rear of the rack that allow either two SATA drives or two PCI-Express cards to be connected to the node. Linux Networx is also doing some 8-socket boards for special situations, but they may or may not be generally available. However at SC05, Linux Networx was showing an 8-socket node with 8 Opteron sockets (dual-core capable), 4 Infinipath NICs, 4 GigE NICs, up to 128 GB of memory, and up to four SATA or four PCI-Express cards per node. Up to 6 of the 4-socket nodes can be put into an 8U sub-rack and up to 4 sub-racks in a normal rack, for a total of up to 96 sockets in a single rack.

Three racks of LS/X. Courtesy of Linux Networx

The LS/X nodes slide into a mid-plane to get their power (from a DC PDU in the bottom of the rack), communication, and expandability. The sub-racks have built-in Tier-1 switching for the Infinipath and GigE networks. The racks can also have Tier-2 switching in the bottom of the rack. These built-in switches greatly reduce the number of required cables. For a full rack you only need 17 cables!! A very high percentage of the parts of the nodes are field replaceable (you just pull them out and put in a new one). The racks are also designed to sit over vented tiles in a raised floor area to pull air up into the rack. This eliminates hot air recirculation. The performance of the LS/X is setting records on benchmarks which should be posted on the website soon. It is very competitive to the IBM Blue Gene, Power 5, Cray X1, Cray XD1 on the HPC Challenge Benchmark. In some cases it has the best performance of any of these systems.

The Intel booth was right next to the Linux Networx booth so I did want to mention that an Intel person, who watched the unveiling of the LS-1 and the LS/X on Monday night, commented that they thought the systems were the "...sexiest machines on the floor..." despite not having Intel chips in them.

Pathscale and Infinipath

I spent some time talking to the Pathscale folks. They are great people to talk to since they know so much and they are so enthusiastic about clusters. Greg Lindahl took some time to demonstrate how to use their compilers to search for the best set of compile flags for performance for a given code. Very cool feature. However, what was even more interesting was that they like to hear what compiler flags people end up using for what codes. Greg said this helps them understand how to improve their compiler. Part of the improvements come from knowing how to better optimize code and part comes from knowing what options are routinely used and how to improve them. he had some very interesting comments about what compile options work well for certain codes.

Even more exciting than their compilers is their Infinipath interconnect. They announced this new interconnect a while ago, but it is now shipping in quantity. Let me tell you, this interconnect is really hot stuff. Pathscale has taken a great deal of care to understand how various parameters affect code performance. While things such as zero-byte packet latency and peak bandwidth are important in some respects, Pathscale has realized that things such as N/2 and message rate are perhaps more important. N/2 is the packet size where the interconnect reaches half of the bandwidth (basically the bandwidth in one direction). You want the smallest N/2 possible for the best code performance and Infinipath has it. In addition, you want the fastest message rate possible out of the NIC for the best performance (seems obvious but I never thought about before). Pathscale took this into account when designing their NIC. They have the best message rate of any interconnect that I know of. In addition, the performance of the NIC gets better as you add cores. Imagine that?

Pathscale has a number of papers on their website that discuss Infinipath and the influence of network performance on code performance and scalability. You can download the papers from their website. They are very useful and informative.

Since I work for Linux Networx and we are using the Infinipath ASIC in our new LS/X system, I can safely say that the benchmarks I've seen using the Infinipath NIC are amazing. We should be posting benchmarks in the near future, but I can safely say that the results will stun people. Very, Very fast.

10 GigE

I've been watching 10 Gigabit Ethernet (GigE) for over a year when companies started to talk about 10 GigE NICS (Network Interface Card). At last year's SC Conference Neterion (formerly S2IO) and Chelsio were showing 10 GigE NICs, primarily using Fiber Optic connections. They were expensive, but so was GigE a few years ago. However, the really large problem was the cost of 10 GigE switches. So I walked around the floor at SC05 talking to various Ethernet switch companies as well as the 10 GigE NIC vendors.

The general consensus was that the prices for 10 GigE NICs are coming down quickly and will continue to do so. Plus copper 10 GigE NICs are common now. But, perhaps more importantly, 10 GigE switches prices are coming down. The 10 GigE switch prices are coming down from the traditional HPC Ethernet companies such as Foundry, Force10, and Extreme Networks. However, the biggest price drops are coming from, perhaps unexpectedly, companies that either haven't traditionally played in the HPC space, are new companies, or companies that are new to Ethernet.

Chelsio

Chelsio was showing their 10 GigE NICs at SC05. They have the lowest list priced 10 GigE NICs I've seen. Their T210-CX 10 GigE NIC has a copper connection while the T210 NIC has a fiber connection. Both are PCI-X NICs (maybe if we ask hard enough they will do a PCI-Express version). They both have RDMA support as well as TOE (TCP Off-Load Engine). Chelsio also has a "dumb" NIC that does not have RDMA or TOE support and uses fiber connectors (N210). Chelsio is also using their 10 GigE technology for the rapidly expanding iSCSI market. At SC05 they announced a PCI-Express based 10 GigE NIC with 4 ports, TOE and iSCSI hardware acceleration.

10 GigE Switches

I didn't get to talk to the primary 10 GigE switch companies - Foundry, Force10 or Extreme, so I'm going to have to rely on their websites and press releases. Foundry currently has a range of switches that can accommodate 10 GigE line cards. Their high end switch, the BigIron RX-16, can accommodate up to 64 10 GigE ports in a single chassis. At the lower end, their SuperX series of switches can accommodate up to 16 ports of 10 GigE.

Force10 has the largest 10 GigE port count in a single chassis that I know of. On Oct. 31 they announced that they had new line cards for their Terascale E-Series switches that would allow them to go to 224 ports of 10 GigE in a single switch (14 line cards with 16, 10 GigE ports per line card). At that size they also said the price per port would be about $3,600. By they way, in the same switch chassis you can also put 1,260 GigE ports.

Extreme Networks was also at SC05. They have a large switch, the BlackDiamond 10808 that allows up to 48 ports of 10 GigE. They are also working with Myricom to use their 10 GigE switches with Myricom's new 10G interconnect NICs.

While not necessarily new, there were some companies showing small port count 10 GigE switches with the lowest per port cost available. Fujitsu was proudly displaying their 12 port, 10 GigE switch. It is one of the fastest 10 GigE switches available with a very low per port cost of approximately $1,200.

Already companies are taking advantage of the Fujitsu 12 port 10 GigE ASIC. One of the traditional HPC interconnect companies, Quadrics, is branching out into the 10 GigE market. At SC05, they were showing a new 10 GigE switch that uses the Fujitsu ASIC. The switch is an 8U chassis that has 12 slots for 10 GigE line cards. Each line card has eight 10 GigE ports that connect using CX4 connectors (they look like the new "thin" Infiniband cables). This means that the switch can have up to a total of 96 ports of 10 GigE. The remaining four ports on the line card are used internally to connect the line cards in a fat tree configuration. This means that the network is 2:1 oversubscribed but looks to have very good performance. This will one of the largest 10 GigE single switch on the market when it comes out in Q1 2006 (that I know of). No prices have been announced, but I've heard rumors that the price should be below $2,00 a port. Quadrics also stated in their press release that they will have follow-on products that increase the port count to 160 and then 1,600.

I also spoke with a new company, Fulcrum Micro and talked to them about a new 10 GigE switch ASIC they are developing. It has great performance (about 200 nanosecond latency) with up to 24 ports and uses cut-through rather than store-and-forward to help performance. The ASIC will be available in Jan. 2006 for about $20/port. A number of vendors are looking at them for making HPC centric 10 GigE Ethernet switches. They have a nice paper that talks about how to take the 24-port 10 GigE switches, built using their ASICs of course, and construct a 288-port fat-tree topology with full bandwidth to each port. The fat-tree would only have a latency of about 400 nanoseconds (two tiers of switches). Maybe the ASICs from Fulcrum Microsystems will get 10 GigE over the price hump and get it on par with other high speed interconnects.

Personal Clusters

The last couple of years various companies have been discussing what some people call "personal clusters." There are a number of reasons for this new genre in clusters.

Supercomputers of the past were large central systems that were shared by users. These systems grew in size and capability, but the needs of the users and the number of users far out stripped the growth. Consequently, the amount of processing time for each user steadily decreased and the overall effective performance of the system for each user decreased. Beowulfs were partially developed as a way to give more power to the individual. The original Beowulf that Tom Sterling and Don Becker developed was designed to be used by one person or a small group of people (basically a workstation). The combination of high performance commodity components, an open-source operating system (Linux), and the availability of parallel libraries and codes, allowed for low-cost, high-performance systems to be built. This was the genesis of beowulfs.

As clusters, particularly beowulfs, became popular they started to replace the traditional central supercomputers. However, the same concept of having a large central shared resource still rules the day. So now we have large centralized clusters for users. So, in my humble opinion, we are falling into the exact same trap that doomed the old supercomputers - reducing the amount of computer power available to a user or a small team of users.

So, how do we prevent clusters from going down the same hole that doomed traditional supercomputers? I'm sure there are several solutions, but one that I see is the development of personal clusters.

At SC05, Bill Gates gave the keynote address where he talked about Microsoft entering the HPC arena and also talked about the personal cluster for individuals or small teams. While I assume that Bill didn't steal the idea, he should have talked to me before the speech anyway. However it is kind of disconcerting to have Microsoft saying the sames things as yourself. Anyway, it looks like Microsoft as well as IDC think that the small to medium size clusters will be on the rise in the next few years.

Front view of Tyan Personal Cluster

There have been a number of personal clusters developed. Orion Multisystems was one of first to announce a true personal cluster. That is, a cluster that ran in a normal cubicle environment with a single power cord and single power switch. Other companies have show scaled down versions of clusters using shorter racks. Rocketcalc has been shipping personal clusters for a few years and has multiple models available. One of them, Delta, uses an 8-socket Opteron motherboard. Other companies have also taken this approach, which gives the user a large SMP machine. However, at SC05, there were a couple of new systems that are notable.

Tyan

Tyan, the motherboard manufacturer is working on a new personal cluster that is a small chassis with four dual socket Opteron boards in it. The chassis is about 12" x 12"; as you look at the front and about 24" deep. There are a number of large fans in the back to cool all four motherboards, but using the large fans helps reduce the noise of the cluster. They use the HE version of the Opteron to reduce power and cooling. Each board can accomodate up to 16 GB of memory. At the current time they include a GigE network to connect the four nodes and to connect the personal cluster to the outside world. It can also have up to 1 TB in storage in the chassis.

Penguin Computing

Penguin Computing was showing a personal cluster that, in my humble opinion, is second to none in terms and power, form factor, and ease of use.

Michael Will of Penguin showing a dual socket node

Penguin Computing has a 4U rack mount chassis called Blade Runner that can accommodate up to 12 blades that have either dual Xeon EM64T processors or dual Opteron HE processors. Each blade can have up to 8 GB of memory and has two built-in GigE NICs. In the picture below, Michael Will of Penguin Computing is holding a dual Xeon blade from one of the 4U Blade Runner racks.

By the way, Michael contributes regularly to the Beowulf mailing list. He is very experienced with clusters, and helps people regardless of whether they have Penguin hardware or not.

This 4U chassis has a built-in GigE switch with an optional second GigE switch. It also has redundant 3+1 power supplies and a built-in KVM capability (presumably via IPMI). Using these 4U boxes, they can get up to 10 chassis, or 120 blades, or 240 processors, or 480 cores (if using dual-core Opterons) per 42U rack. That's density.

While the Blade Runner chassis are for normal rack mount clusters, Penguin has done something very, very clever. They have taken the concept behind the Blade Runner, created a vertical chassis that uses the exact same blades, put wheels on it, and made this single Penguin Personal Cluster unit into THE BEST, bar none, desktop cluster I've seen.

Side View of the Penguin Computing Personal Cluster

In the Penguin Personal Cluster they can get up to 12 blades, or 24 processors, or 48 cores (if using dual-core Opterons) in a single chassis. With up to 8 GB of memory per blade, they can get up to 96 GB of total memory in the unit. Each blade can also hold at least one 2.5"; hard drive (the Xeon blades can hold up to two 2.5" hard drives). To minimize the power and cooling, they use low-voltage Xeon processors or Opteron HE processors. The chassis has a built-in GigE switch with 20 ports (12 internal and 8 external). There is an optional second GigE switch so that you could use channel bonding for each blade. They have also incorporated workstation required peripherals (e.g. DVD, USB).

The Application-Ready Personal Cluster from Penguin Computing has the best balance of number of processors and individual CPU performance I've seen. At SC05, Penguin had a 12 blade, dual processor, dual-core Opteron system running Fluent in the AMD booth (total of 48 cores). It was quiet enough that I could stand next to it and talk to someone without having to raise my voice (despite the fact that I was losing my voice). Uber-cool and a very, very useful personal cluster. Nice work Penguin!

Myricom 10G

Myricom will have their new 10G product shipping any day now. It is an interesting product because the NICs can be plugged into a 10 GigE switch and they will behave like normal 10 GigE TCP NICs and speak TCP. They can also be plugged into Myricom's switches and they behave like Myrinet NICs (running MX). Pretty interesting idea. However until the price of 10 GigE switches comes down, using the TCP capability of 10G is really only good for an uplink. But, as I said earlier, the price is coming down. In the meantime, the Myricom switches will give you good performance with 10G NICs.

Front View of the Penguin Computing Personal Cluster

Side View of teh Penguin Computing Personal Cluster

Clearspeed

Clearspeed is a company that is developing an array processor ASIC for a year or so. The goal is to accelerate floating point computations for very little power. The array processor chip has 96 processing units with each processor having 6K of SRAM. There is also 128 KB of scratchpad memory for the chip and 576 KB of on-chip memory. Clearspeed will be shipping a PCI-X card that has two of these chips. Each card can also have up to 8 GB of DDR2 memory. The card communicates to the main processors and memory over the PCI-X bus with a resulting bandwidth of 3.2 GB/s. They are working on a BLAS and FFTW library for the card so all you have to do is link to their library and the resulting code can run on the card. They also have an API for writing your own code.

At SC05, Clearspeed was demoing the card running a simple DGEMM (double precision matrix multiply) computation. However, the performance was anything but simple. The card was getting about 30 GLFOPS sustained performance! (a fast dual Opteron gets about 8 GFLOPS). Also, the card was only using about 25 watts of power (A standard Opteron has a thermal envelope with of 90 Watts).

AMD Booth

AMD usually has a unique booth at shows. Rather than just talk about AMD processors, they invite partnering companies to show in the booth with them. This helps smaller companies who can't afford a booth and gets them more exposure being in the AMD booth, and it helps AMD by showing how many partners they have and what unique and interesting things they are doing. SC05 was no exception. AMD had a number of companies in their booth. They had a neat pillar with a bunch of motherboards from various companies. Some of the boards had HTX slots ( HyperTransport eXtension) and some had dual sockets and quad sockets. One also had lots of DIMMs slots per socket (up to 8 in one case) which screams lots of memory. The thing that impressed me the most about the motherboards was the variety and the innovation that companies are showing. I also think it's safe to say that you will see a number of HTX motherboards coming out in Q1 2006. Now if I could find a board company that has a Micro-ATX board with built-in video, GigE, HTX and PCI-Express, then I would be very happy. I guess you know what's on my Christmas list :)

PSSC Labs was showing a liquid cooled 1U server node that was very, very quiet. The key to the reduced noise is that they are using small fans that run at about 6,000 rpm (the usual 1U fans run at about 12,000 to 13,000 rpm). They didn't know much about the liquid since the cooling system is made by another company, but they did know that it is not conductive. The cooling system was very interesting because in a small vertical space like a 1U they had a radiator with three or four of the small, low-speed fans, a pump, and a reservoir. They had the cooling attachments for a single or dual socket system. This approach might even give a quad-socket machine a chance to be put into a 1U without violating OSHA noise standards. The gentlemen at PSSC Labs said that the 1U liquid cooled servers should be out later in 2006.

Verari had one of their blades on display in the AMD booth as well. It's a very nice blade that takes a COTS motherboard and turns it vertically along with a hard drive and power supply. Their rack then takes the blades and connects them to power and communication. Verari also introduced their new BladeRack 2 product to handle the increased power requirements from new processors.

Onward and Upward

I hope everyone enjoyed the show in Seattle. If you couldn't make it, then perhaps you can use these comments to convince your boss you should go next year. Next year's show is in Tampa Bay (ahh, the South). The next two shows after that are in Austin (great BBQ - I recommend the"Green Mesquite") and Reno (think smokey stinky hotel rooms - yuck).

Jeff Layton has been a cluster enthusiast since 1997 and spends far too much time reading mailing lists. He occasionally finds time to perform experiments on clusters in his basement.He also has a Ph. D. in Aeronautical and Astronautical Engineering and he's not afraid to use it.