Hits: 16409
"Clusters of Bare Motherboards" -- Jeff's new rock band

The Beowulf mailing list provides detailed discussions about issues concerning Linux HPC clusters. In this article I review some postings to the Beowulf list on clusters of bare motherboards and choosing a high-speed interconnect.

Beowulf of Bare Motherboards?

The experimental, entrepreneurial spirit for clusters is alive and well! On Sept. 27, 2004, Andrew Pisorski asked about putting some bare motherboards on metal racks by directly attaching them to the racks in some fashion. Even though Andrew didn't mention it, the goals of projects such as these, are to reduce costs (no cases) and increase the density for little or no cost (blade servers from vendors are fairly pricey). Jack Wathey responded that in late 2003, he built a cluster with bare motherboards attached to metal shelves but the motherboards were mounted to aluminum sheets. He attached the motherboards using nylon standoffs. He also said that Ken Chase had done something similar by packing the motherboards close together with no shielding between them in a Plexiglas case. He said that Ken had some problems because the boards had some radio frequency interference between them and some would not boot. He as also worried about the risk of fire with so much wood and plastic near high power components.

Glenn Gardner responded that he had built a cluster in his apartment using mini-itx motherboards. He said that the project was fun and he learned some lessons. First, he thought that drilling all of those holes was a ton of work. Second, thermal issues are likely to be big and you will require some fans to provide adequate cooling. Third, he thought RFI (radio frequency interference) could be a big issue so you should prepare for it. Fourth, power distribution might be an issue (watch startup power requirements so you should stagger the nodes during boot up). Fifth, he thought mechanical integrity was important and that you should plan accordingly. He also pointed out that if you need to replace a motherboard you could end up dismantling the entire cluster!

Alvin Oga jumped in to post some comments on a mini-itx project he was working on. He mentioned that he's working on a 4U blade system where each blade has two mini-itx boards (10 CPUs per 4U).

Jim Lux then posted on this topic. He first commented on having to drill hundred of holes for mounting motherboards and recommended looking at contracting this out. He also said that one advantage of dense packing is that you can use a few large diameter fans that are very efficient at moving large amounts of air fairly quietly (efficiency of the fans goes up as the diameter increases). Jim also had some very good discussion about shielding and pointed out that good size case holes won't hurt the EMI performance and can allow a wireless signal to make it into the system.

Florent Calvayrac posted that a link to small Beowulf with which he was involved that consists of an 8 CPU Athlon system. The motherboards were mounted inside a sheet metal box. Florent stated that the design took about 20 hours and fabrication took about 1 week. A really interesting thing is that some graduate students did a thermal/cooling analysis of the system and predicted the temperatures to within 1 deg. C.

Andrew Pisorski responded that he appreciate everyone's comments and then asked about using one power supply for several motherboards. He also wondered what he could do to stagger the boot sequence for the motherboards on a single power supply to reduce the peak load. Jim Lux responded with some wonderful insight into the power usage at start up and pointed out that the biggest draw of power at startup is the hard drive spinning up. Since this is a big peak power issue, going with diskless nodes has a definite advantage.

Choosing a High-speed Interconnect

[Note: Since this discussion, we have posted our Cluster Interconnects: The Whole Shebang review. You may find this article helpful in addition to the comments below.]

Everyone loves speed. We're all speed junkies at heart. On October 12, 2004, Chris Sideroff asked about selecting a high-speed cluster interconnect. The group he works with has a 30 node dual Opteron cluster with GigE (Gigabit Ethernet) and wanted to upgrade to something like Myrinet, or Quadrics, or Infiniband (IB). Later Chris mentioned that they were running Computational Fluid Dynamics (CFD) codes including Fluent.

The cluster expert Joe Landman posted some very good questions that everyone should consider before upgrading to a high-speed interconnect. In essence Joe was suggesting profiling your application(s) to locate the bottlenecks. Once this is done, you need to decide what aspect of "fast" in fast interconnect, is needed - i.e. low-latency and/or bandwidth. Then Joe said that all of the high-speed interconnects have HBA's (Host Bus Adapters - i.e. a "NIC") that range in price from $500-$2,000 and switch prices that are about $1,000 a port (so anywhere from $1,500 to $3,000 a port).

Michael Prinkey suggested looking at the GAMMA project which is a similar idea to the M-VIA (Virtual Interface Architecture) implementation over GigE. The well respected Mikhail Kuzminsky said that they have had trouble with GAMMA on SMP kernels and that they had trouble with the Intel Pro/1000 NICs, Moreover then found that Intel frequently modified the chipset, causing compatibility problems. [Note: Some of these problems have been fixed in newer versions of GAMMA. In addition, the M-VIA project seems to have been discontinued.]

List regular Robert Brown then posted some further suggestions about determining if the application(s) needed a high-speed interconnect. One thing he suggested is that if the application passes large messages then it is likely to be bandwidth limited. If it passes a bunch of small messages, then it is likely to be latency limited. Robert also suggested talking to the various high-speed interconnect vendors to see if they had a 'loaner' cluster that could be used for testing.

Chris Samuel posted that Fluent is very latency sensitive and that Fluent is likely to support Myrinet on Opteron CPUs.

The experienced Mark Hahn posted that the last clusters he bought with Myrinet and IB had about the same latency and cost. However, he hasn't seen any users who are bandwidth limited which means IB's superior bandwidth is not important. However, he did point out that Myrinet could use dual-rail cards, which allow Myrinet to have the same bandwidth as IB but raises the cost above IB. Mark also thought that Quadrics, while more expensive than Myrinet and IB, had lower latency and about the same bandwidth as IB. Mark also admitted that he was a bit skeptical about IB claims of better performance and lower cost, but also admitted he didn't have any IB experience. He also echoed the comments of others that you need to be sure that you need a high-speed interconnect by testing your application(s).

Matt Leininger took issue with some of Mark's comments about IB not being field tested. He stated that where he works they have been running IB in production on clusters of 128 nodes and up. He also mentioned that a large cluster in Japan, RIKEN, has been running IB stably for over 6 months. Matt also mentioned that IB has several vendors to choose from (he mentioned four of them). He lastly pointed out that IB has much more field time than the latest Myrinet offering [Note: This was 2004. Things are much different today with a large number of IB clusters in production and Myricom has a new interconnect - Myri-10G].

Joe Landman responded that AMD has a Developers Center with at least one cluster with IB. It might be worth while to get an account and test on this machine. Joe also observed that he thought IB was drawing wide spread support and is not single sourced.

Mark Hahn responded to Matt's posting and thanked him for the information on IB clusters in production. Mark also asked if the various IB vendors only just used Mellanox chips with some minor mods, thereby limiting IB to a single vendor. However, Mark was still a bit skeptical about IB.

Daniel Kidger from Quadrics then posted with some more detailed information about their offerings. He said that their QsNetII interconnect sells for about $1,700 a node ($999 a card with the rest being for switches and cables). He also wrote that IB has about the same bandwidth, but twice the latency (3.0 microsec vs. 1.5 μsec). He also said that Myrinet was lagging behind but they do have a new product coming out. Daniel then went on to echo the comments of others that you need to profile your application(s) and test on the various high-speed interconnects.

Bill Broadley also echoed the comments of everyone about testing one's code (have you noticed all of the luminaries in the Beowulf community are saying the same thing - test your application first?). Bill also had a good suggestion about forcing the GigE NICs to only go at Fast Ethernet speeds to see the effects on the code performance (a quick and dirty way to test the effects of network performance).

Finally Joe Landman suggested getting "atop" to help profile your application. He also mentioned that if you see a lot of time being spent in a process called 'do_writ', then the code could also be I/O bound, which opens up a whole new can of worms.

Sidebar One: Links Mentioned in Column






This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Jeff Layton has been a cluster enthusiast since 1997 and spends far too much time reading mailing lists. He can found hanging around the Monkey Tree at (don't stick your arms through the bars though).