dual Xeon issue
Thomas Alrutz
Thomas.Alrutz at dlr.de
Fri Jun 6 07:28:24 EDT 2003
> Sorry for a noob-like question. We are in the process of buying a Xeon-based cluster > for scientific computing (running MPI parallelized numerical models). > We've benchmarked clasters offered by Microway and Aspen Systems and discovered that we
> are getting much better results using only one CPU per box (running the model on 6 boxes
> using one CPU in each box was about 50% faster than running it on 3 boxes using both CPUs).
> This was surprising to us since we've used to be limited by interprocessor communications,
> which should be a lot faster between CPUs in one box. Can anybody explain the reason for
> this and, more importantly, is there any way to improve this situation.
>
> Thanks,
>
> Sergei.
>
The so called "poor performance" in parallel mode when using 2 CPUs per
node (box), comes due to the fact that the memory bandwidth is limited.
If you are employing 2 CPUs then the 2 jobs share there memory
interface. Otherwise if you are employing only 1 CPU and no other job is
running on the node, you have the full memory bandwidth exclusive.
This yields to the performance gain of only 50% when both CPUs are working.
Our benchmarks using the unstructured Navier-Stokes TAU-code solver for
a numerical calculation had shown the same behavior.
We used a wing-body-engine aircraft configuration with 2 million grid
points for the benchmarks and employed a full multigrid cycle to test
the communication (here MPI-calls for a domain decomposition model).
The performance gain (faster main loop times for a iteration) that we
got for the 2. CPU are :
Athlon MP FSB 133 1.6 GHz 72%
Xeon Rambus FSB 100 2.0 GHz 55%
Xeon DDR FSB 100 2.4 GHz 47%
Xeon DDR FSB 100 2.8 GHz 43%
Xeon DDR FSB 133 2.4 GHz 50%
These are the values for 1 node in use. We have watched a decrease of
the performance gain for the 2. CPU, when more then 1 node were used for
the calculation (e.g. Xeon 2.4 GHz FSB 100 on 8 nodes only 37% for the
2. CPU).
So if you are using a code that needs a lot of memory transfers, then
you have decide, if the gain of the performance is worth the cost of the
2. CPU!
But if you are looking for good benchmark results you might try an
Opteron system (NUMA instead of Bus).
Thomas
--
__/|__ | Dipl.-Math. Thomas Alrutz
/_/_/_/ | DLR Institute of Aerodynamics and Flow Technology
|/ | Numerical Methods
DLR | Bunsenstr. 10
| D-37073 Goettingen/Germany
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list