|
Last fall, I had an opportunity to test the new quad-core Intel® Xeon® 5400 processor
(Harpertown) processor. As the Harpertown was in short supply, I only
had two nodes (two sockets on each node for total of 16 cores) with which to run my tests.
I'm not picky, however,
I like to run HPC tests as this type of data is absent in the
mainstream press. In particular, I am most interested in how multi-core and MPI
play in the HPC space. There is plenty to discuss, but for some interesting benchmark numbers, read on.
The hardware was provided by Appro as part a white
paper I was writing for SC07 (SuperComputing 2007).
After some delays, like converting my temporary basement office to a
real office with walls, I have finally managed to publish some of the
more interesting finding from my tests. But first. a little background may be useful.
In 2006, I wrote a similar white paper
using the Clovertown processor. Again, this paper was for the annual SC
(SuperComputing 2006) show. When I originally ran tests on the Clovertown
platform, I was not that impressed. Some codes did well, but with others, all the extra cores did not make that much difference (i.e. programs running across 8 cores on a single node only saw 2-3 effective cores).
When I received the Harpertown systems, I was not sure what to
expect.
My main goal for the tests was to provide performance data for MPI
applications. After all, that is what HPC users really care about.
Officially, I was using a Xeon 5410 (Harpertown) 2.3 GHz. In my
previous paper, I had used a higher clocked Xeon 5300 (Clovertown) 2.6 GHz.
Comparing the new and old processors was a secondary goal for the tests (If you consider
a processor that has been on the market on year as old). In
addition, I was interested in determining some MPI multi-core
guidelines for HPC users. I should note that my goal was not to
optimize a specific benchmark.
The hardware consisted of the following; two Appro HyperServer nodes
connected by Mellanox®
Infiniband HCAs (model MHEA28-XTC). Each node has a two socket Intel
motherboard (S5000PALR), 8 GBytes of FBDIMM RAM, and a single SATA
hard drive. Each motherboard also holds two 2.3 GHz quad-core Xeons
(E5410) for a total of eight cores per node.
The software environment is based on Fedora Core 6 with Linux
Kernel 2.6.22-12. The Fedora distribution was enhanced with the Basement
Supercomputing Baseline Cluster Suite. The benchmarks were
compiled with GNU gcc/gfortran version 4.2.1. Open MPI was chosen
because it has support for Mellanox Infiniband and because it
supports a level or processor affinity not available on other open
source MPI packages.
The NAS
parallel Benchmark (NPB Version 2.3) suite was used as a
good overall test of system performance. The NAS tests are a small
set of programs designed to help evaluate the performance of parallel
supercomputers. The benchmarks, which are derived from computational
fluid dynamics (CFD) applications, consist of five kernels and three
pseudo-applications.
In addition, to being widely used, the NAS suite are
self checking and exhibit a range of program behaviors. All codes
were run three times and results were averaged. Also note, results
are measured in MFLOPS (Million Floating Point Operations Per Second)
except for IS which should be measured in MIOPS (Million Integer
Operations Per Second).
The Numbers
As mentioned, in a previous white
paper I had measured the performance of the quad-core Xeon X5355
processor for various compiler and MPI configurations. For this paper
I ran a series of similar tests using four Xeon E5410 processors in
essentially the same hardware as the previous tests (The motherboard
was upgraded to use a S5000 Blackford chip set that supports the 5400
series Xeon processor). First, I compared single core
performance by compiling all the NAS tests to run as a single process
on one core (no MPI is used). Table One illustrates the
dramatic performance improvement over previous tests. Several aspects
of these results are worth noting. First, improvements in the GNU
compilers (gfortran in particular) probably contributed to some, but
not all of the increased performance. Second, the current tests used
2.3 GHz processors while the previous tests used faster clocked 2.6
GHz processors. Finally, not all tests showed improvements. In
particular, the EP (Embarrassingly Parallel) slowed down
substantially, which we attribute to an optimization issue. We chose
to leave EP out of the analysis for this reason. The IS (Integer
Sort) benchmark showed no improvement possibly because it is the only
integer based program in the test suite.
|
Test
|
Xeon X5355 - Clovertown 2.66GHz
GNU gcc/g77 3.4.5
Options=-O2 -ffast-math
MFLOPS - Using one core
|
Xeon E5410 - Harpertown 2.33 GHz
GNU gcc/gfortran 4.1.2
Options= -O2 -ffast-math
MFLOPS - Using one core
|
Improvement
|
|
BT
|
506
|
832
|
1.65
|
|
CG
|
309
|
333
|
1.08
|
|
FT
|
656
|
993
|
1.51
|
|
IS
|
39
|
38
|
0.99
|
|
LU
|
611
|
940
|
1.54
|
|
MG
|
589
|
1098
|
1.87
|
|
SP
|
465
|
685
|
1.47
|
Table
One: X5355 (Clovertown) vs Xeon E5410 (Harpertown) on a single core
Because single core benchmarks for
multi-core processors can be misleading, I also compared the full 16
core MPI results for the benchmark suite. Table Two provides
these results. All the benchmarks improved by at least 42% (1.42
times faster) and one test (IS) almost doubled in performance.
|
Test
|
Xeon X5355 - Mellanox IB
Clovertown 2.66GHz - 16 cores
MFLOPS
|
Xeon E5410 - Mellanox IB
Harpertown 2.33 GHz - 16 cores
MFLOPS
|
Improvement
|
|
BT
|
2763
|
4031
|
1.46
|
|
CG
|
1092
|
1742
|
1.59
|
|
FT
|
2727
|
4804
|
1.76
|
|
IS
|
149
|
288
|
1.94
|
|
LU
|
8026
|
11405
|
1.42
|
|
MG
|
3742
|
5648
|
1.51
|
|
SP
|
2711
|
4858
|
1.79
|
Table
Two: X5355 (Clovertown) vs Xeon E5410 (Harpertown) over sixteen cores
In general, the results presented here show outstanding
improvements in performance over previous generation hardware. It
should be emphasized the previous hardware had a clock rate that was
12% faster than the current hardware. Even with this clock rate
handicap, the new E5410 Xeon ran an amazing 42-94% faster than
previous reported results. In addition, increased scalability was
noted for almost all cases and thus resulting in better utilization of
each processor.
Full results and some MPI
optimization strategies are available in the Appro white
paper. I should note that the paper requires providing a small bit of information to Appro
before you can download it. Not a bad bargain considering, this type of data is
hard to find. Indeed, as white papers go, it has plenty of real unbiased MPI on multi-core data
plus some optimization strategies. The good news is I still have the hardware and more tests will
be forthcoming.
Douglas Eadline is the editor of ClusterMonkey.
Comment on this article
You must login to leave comments...
Other Visitors Comments
There are no comments currently....
|