Of Pentium D, Ethernet, and those assumption we all make
Recently, I did some benchmarking using Intel Pentium D® processors and gigabit Ethernet. The data are pretty impressive. If I were a non-technical person, I would probably say, Pentium D kicks ass, but you know, I like numbers and have a professional reputation to uphold. Therefore, in a professional sense I can say, Pentium D really kicks ass. To prove my point, this article presents some of the highlights from a recent white paper I prepared for Appro International called Achieving High Performance at Low Cost: The Dual Core Commodity Cluster Advantage. For a more complete description of the tests and results (including benchmark numbers) you probably want to download the white paper.
Back In The Day
Back when clusters started stirring up trouble in High Performance Computing (HPC) world, there were those that said things like, there is no way commodity hardware can stand up against real iron, or you cannot build a real supercomputer from PC parts. We all know how that turned out.
Today's cluster nodes typically have dual cores sitting in dual sockets connected by a low latency/high throughput network. In market terms, this is data center/sever level hardware -- the good stuff (and expensive). At the lower end of the spectrum is the desk-top hardware, which one would assume is not really up to snuff as far as HPC goes. You certainly cannot build a real supercomputer from this type of hardware, you probably have to use gigabit Ethernet for heavens sake! Sounds like an assumption to me. Some numbers are needed.
In the past, fellow monkey Jeff Layton and I have written about very low cost commodity computing where $2500 can get you 14.5 GFLOPS running HPL (the Top500 Benchmark). These results can easily be improved upon today as the tests were performed in 2004. Indeed, the introduction of low cost dual core processors combined with some innovative motherboards make the commodity proposition a very real alternative to the high end server hardware. Enough talk, let's get to the results because they tell the real story.
Pentium D You Say?
For the tests, I used the recently introduced 3.2 GHz Pentium D (Presler) processor from Intel (which will eventually be replaced by the Xeon 3000 line). The Presler series is a dual-core processor manufactured using the latest 65nm process and is currently available at speeds up to 3.40 GHz. More importantly for HPC users, each Presler has 4 MB of on-chip cache which it divides evenly between the two cores (2 MB each). These caches are fed using an 800MHz FSB and DDR2 memory.
We used eight of these to create a 16 core cluster.
As a way to introduce the results, lets look at some of the assumptions currently floating around the HPC market, but first the standard advisory. As with all things cluster, performance depends on your application. If your application(s) do not behave like the benchmarks, then you may want to do your own testing. In my testing, I used the NAS Parallel Benchmark Suite and the GROMACS Molecular Dynamics package. You also may want to look at Parallel Molecular Dynamics: Gromacs by Erik Lindahl.
The cluster consisted of eight Pentium D 940 (3.2 GHz) processors (16 cores total), one per motherboard, connected with an SMC 8 port gigabit Ethernet switch. (See the Testing Methodology Sidebar at the end of the article for more information.) Based on my testing, the following assumptions may be worth checking:
Server CPUs Are Best For HPC (particularly the Opteron) ?
In some cases they probably are best. If you look at SPEC numbers you will see that processors like the Pentium D hold their own against their larger siblings. While the SPEC benchmarks are an important yardstick, real application benchmarks often give another data point with which to compare processors. The GROMACS molecular dynamics package is known to push processors very hard and is therefore a good test of overall number crunching capability.
In my tests, a more expensive Opteron 270 (2 Ghz) was on average 22% slower than a Pentium D 940 (3.2 GHz) when running the GROMACS single processor benchmarks.
More Sockets Are Better ?
Cramming cores and CPUs on motherboards sounds like a good idea. A dual socket motherboard can now support four cores (and soon eight cores). In some cases this is a good idea, in others I am not so sure. There is much to understand about four cores sharing memory and optimum performance. In addition, the more cores on a motherboard the more eggs your put in one basket. A failed power supply or motherboard now takes out four cores.
The recent introduction of the Intel Caretta motherboard (Model S3000PT)
has been designed to address these issues. The Caretta motherboard supports the Intel Pentium 4/Intel Pentium D processor (Presler), four DIMM slots (DDR2 533/667 with ECC, 2-way interleaved, unbuffered), Integrated 2 port SATA 3.0Gb/s with RAID 0 &1, an ATI ES1000 (16MB), Dual gigabit Ethernet LAN, and 5.95 inch x 13 inch form factor. Interestingly, the form factor is one half the size of a Extended ATX (12"x13") motherboard. These dimensions allow a standard rack mount ATX enclosure to hold two Caretta motherboards allowing for higher density, less memory contention, and a lower impact of component failure.
A standard cluster node can then hold two separate Caretta motherboards each with its own memory and power supply. The Caretta is only available through integrators. Contact them, they know about it.
Gigabit Ethernet Is Too Slow ?
For some applications gigabit Ethernet is too slow. Particularly if you are trying to service four cores on one motherboard. Most people are not aware, however, that if properly tuned, gigabit Ethernet can be very effective for some application.
Using Netpipe, systems were able to achieve a maximum throughput of 111 MBytes/sec and a single byte TCP latency of 36 μseconds using one of the on-board Ethernet ports.
Gigabit Ethernet Will Not Scale ?
As part of the testing, I wanted to see if gigabit Ethernet could keep up with the processors and test to see the effect of the dual cores on performance. A full accounting of the numbers are in the White Paper. Some of the conclusion are quite interesting:
The NAS benchmark was run on four, eight, and 16 cores. As would be expected, some codes (Integer Sort) did not scale well over gigabit Ethernet, however, in the case of the LU benchmark, 8 processors using 16 cores delivered a speed-up of 11.7 times for a total of 6.34 GFLOPS.
For the GROMACS benchmark, the 8-way scaling produced a 6.5 times speed-up (7.57 GFLOPS) and was able to achive a 9.3 times speed-up (10.84 GFLOPS) using 16 cores.
Servers Are the Price-To-Performance Leaders ?
Leveraging commodity hardware with the proper cluster software can provide quite astounding price-to-performance. Again, price-to-performance should always be cast in terms of an application. For example:
If the price ratios for Pentium D 940 and Opteron 270 systems are combined with the GROMACS performance data, then the price-to-performance of the Opteron solution is almost double that of the Pentium D solution -- which means you spend almost double the money to get the same performance!
If I had more time, I would have tested many more applications and worked on improving the current numbers, but the performance picture quickly became clear. If you are interested in getting more bang for your buck, then test your assumptions and take a look at all the hardware options, even the ones you dismissed in the past. Here are a few steps that can help you get started:
Consult experienced integrators, like Appro, about
your applications and needs. They are qualified to to take commodity technology and turn it into a real industrial strength cluster.
Consult an experienced software partner, like Basement Supercomputing, they understand how to get the most out of your cluster and will be there when it is time to upgrade or make changes.
Finally, never assume! If possible test as many assumptions that you can and do not be afraid to rethink your position.
Clustering began in the mid 1990's with commodity-off-the-shelf (COTS) hardware. It was a good idea then and it appears to be a a good idea now.
Tests were conducted using eight dual core Intel Pentium D (model 940) Presler servers operating at 3.2 GHz. Each server used a Nobhill motherboard (Intel Model SE7230NH1) which is functionally equivalent to the Caretta motherboard, but larger in size. Each node had 8GB of DDR2 RAM and two gigabit Ethernet ports (only one of which was used for the testing). A SMC 8508T Ethernet switch was used to connect the servers. Ethernet drivers were also set to provide best performance for a given test. In addition, where appropriate, the MPI tests were run with "sysv" flag to cause the same processor cores to communicate through memory. Contact the author for details. The software environment was as follows:
Appro International: Appro is a leading developer of high-performance, density-managed servers, storage subsystems, and high-end workstations for the high-performance and enterprise computing markets.
Benchmark and Author Contact Information: Raw benchmark data are available here. Douglas Eadline, PhD can be reached at deadline ( at ) basement-supercomputing ( period ) com
GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. It is primarily designed for biochemical molecules like proteins and lipids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers.
NAS Parallel Benchmark: These tests are a small set of programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications.