MPI Benchmarks on Intel Multi-core | Cluster Hardware

Last fall, I had an opportunity to test the new quad-core Intel® Xeon® 5400 processor (Harpertown) processor. As the Harpertown was in short supply, I only had two nodes (two sockets on each node for total of 16 cores) with which to run my tests. I'm not picky, however, I like to run HPC tests as this type of data is absent in the mainstream press. In particular, I am most interested in how multi-core and MPI play in the HPC space. There is plenty to discuss, but for some interesting benchmark numbers, read on.

The hardware was provided by Appro as part a white paper I was writing for SC07 (SuperComputing 2007). After some delays, like converting my temporary basement office to a real office with walls, I have finally managed to publish some of the more interesting finding from my tests. But first. a little background may be useful.

In 2006, I wrote a similar white paper using the Clovertown processor. Again, this paper was for the annual SC (SuperComputing 2006) show. When I originally ran tests on the Clovertown platform, I was not that impressed. Some codes did well, but with others, all the extra cores did not make that much difference (i.e. programs running across 8 cores on a single node only saw 2-3 effective cores). When I received the Harpertown systems, I was not sure what to expect.

My main goal for the tests was to provide performance data for MPI applications. After all, that is what HPC users really care about. Officially, I was using a Xeon 5410 (Harpertown) 2.3 GHz. In my previous paper, I had used a higher clocked Xeon 5300 (Clovertown) 2.6 GHz. Comparing the new and old processors was a secondary goal for the tests (If you consider a processor that has been on the market on year as old). In addition, I was interested in determining some MPI multi-core guidelines for HPC users. I should note that my goal was not to optimize a specific benchmark.

The hardware consisted of the following; two Appro HyperServer nodes connected by Mellanox® Infiniband HCAs (model MHEA28-XTC). Each node has a two socket Intel motherboard (S5000PALR), 8 GBytes of FBDIMM RAM, and a single SATA hard drive. Each motherboard also holds two 2.3 GHz quad-core Xeons (E5410) for a total of eight cores per node.

The software environment is based on Fedora Core 6 with Linux Kernel 2.6.22-12. The Fedora distribution was enhanced with the Basement Supercomputing Baseline Cluster Suite. The benchmarks were compiled with GNU gcc/gfortran version 4.2.1. Open MPI was chosen because it has support for Mellanox Infiniband and because it supports a level or processor affinity not available on other open source MPI packages.

The NAS parallel Benchmark (NPB Version 2.3) suite was used as a good overall test of system performance. The NAS tests are a small set of programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications. In addition, to being widely used, the NAS suite are self checking and exhibit a range of program behaviors. All codes were run three times and results were averaged. Also note, results are measured in MFLOPS (Million Floating Point Operations Per Second) except for IS which should be measured in MIOPS (Million Integer Operations Per Second).

The Numbers

As mentioned, in a previous white paper I had measured the performance of the quad-core Xeon X5355 processor for various compiler and MPI configurations. For this paper I ran a series of similar tests using four Xeon E5410 processors in essentially the same hardware as the previous tests (The motherboard was upgraded to use a S5000 Blackford chip set that supports the 5400 series Xeon processor). First, I compared single core performance by compiling all the NAS tests to run as a single process on one core (no MPI is used). Table One illustrates the dramatic performance improvement over previous tests. Several aspects of these results are worth noting. First, improvements in the GNU compilers (gfortran in particular) probably contributed to some, but not all of the increased performance. Second, the current tests used 2.3 GHz processors while the previous tests used faster clocked 2.6 GHz processors. Finally, not all tests showed improvements. In particular, the EP (Embarrassingly Parallel) slowed down substantially, which we attribute to an optimization issue. We chose to leave EP out of the analysis for this reason. The IS (Integer Sort) benchmark showed no improvement possibly because it is the only integer based program in the test suite.

Test	Xeon X5355 - Clovertown 2.66GHz GNU gcc/g77 3.4.5 Options=-O2 -ffast-math MFLOPS - Using one core	Xeon E5410 - Harpertown 2.33 GHz GNU gcc/gfortran 4.1.2 Options= -O2 -ffast-math MFLOPS - Using one core	Improvement
BT	506	832	1.65
CG	309	333	1.08
FT	656	993	1.51
IS	39	38	0.99
LU	611	940	1.54
MG	589	1098	1.87
SP	465	685	1.47

Table One: X5355 (Clovertown) vs Xeon E5410 (Harpertown) on a single core

Because single core benchmarks for multi-core processors can be misleading, I also compared the full 16 core MPI results for the benchmark suite. Table Two provides these results. All the benchmarks improved by at least 42% (1.42 times faster) and one test (IS) almost doubled in performance.

Test	Xeon X5355 - Mellanox IB Clovertown 2.66GHz - 16 cores MFLOPS	Xeon E5410 - Mellanox IB Harpertown 2.33 GHz - 16 cores MFLOPS	Improvement
BT	2763	4031	1.46
CG	1092	1742	1.59
FT	2727	4804	1.76
IS	149	288	1.94
LU	8026	11405	1.42
MG	3742	5648	1.51
SP	2711	4858	1.79

Table Two: X5355 (Clovertown) vs Xeon E5410 (Harpertown) over sixteen cores

In general, the results presented here show outstanding improvements in performance over previous generation hardware. It should be emphasized the previous hardware had a clock rate that was 12% faster than the current hardware. Even with this clock rate handicap, the new E5410 Xeon ran an amazing 42-94% faster than previous reported results. In addition, increased scalability was noted for almost all cases and thus resulting in better utilization of each processor.

Full results and some MPI optimization strategies are available in the Appro white paper. I should note that the paper requires providing a small bit of information to Appro before you can download it. Not a bad bargain considering, this type of data is hard to find. Indeed, as white papers go, it has plenty of real unbiased MPI on multi-core data plus some optimization strategies. The good news is I still have the hardware and more tests will be forthcoming.

Douglas Eadline is the editor of ClusterMonkey.