dual Xeon issue

Fri Jun 6 07:28:24 EDT 2003

> Sorry for a noob-like question. We are in the process of buying a Xeon-based cluster > for scientific computing (running MPI parallelized numerical models). > We've benchmarked clasters offered by Microway and Aspen Systems and discovered that we
> are getting much better results using only one CPU per box (running the model on 6 boxes
> using one CPU in each box was about 50% faster than running it on 3 boxes using both CPUs). 
> This was surprising to us since we've used to be limited by interprocessor communications, 
> which should be a lot faster between CPUs in one box. Can anybody explain the reason for 
> this and, more importantly, is there any way to improve this situation.
> 
> Thanks,
> 
> Sergei.
> 

The so called "poor performance" in parallel mode when using 2 CPUs per 
node (box), comes due to the fact that the memory bandwidth is limited.
If you are employing 2 CPUs then the 2 jobs share there memory 
interface. Otherwise if you are employing only 1 CPU and no other job is 
running on the node, you have the full memory bandwidth exclusive.
This yields to the performance gain of only 50% when both CPUs are working.

Our benchmarks using the unstructured Navier-Stokes TAU-code solver for 
a numerical calculation had shown the same behavior.
We used a wing-body-engine aircraft configuration with 2 million grid 
points for the benchmarks and employed a full multigrid cycle to test 
the communication (here MPI-calls for a domain decomposition model).

The performance gain (faster main loop times for a iteration) that we 
got for the 2. CPU  are :
Athlon MP   FSB 133 1.6 GHz  72%
Xeon Rambus FSB 100 2.0 GHz  55%
Xeon DDR    FSB 100 2.4 GHz  47%
Xeon DDR    FSB 100 2.8 GHz  43%
Xeon DDR    FSB 133 2.4 GHz  50%

These are the values for 1 node in use. We have watched a decrease of 
the performance gain for the 2. CPU, when more then 1 node were used for 
the calculation (e.g. Xeon 2.4 GHz FSB 100 on 8 nodes only 37% for the 
2. CPU).

So if you are using a code that needs a lot of memory transfers, then 
you have decide, if the gain of the performance is worth the cost of the 
2. CPU!

But if you are looking for good benchmark results you might try an 
Opteron system (NUMA instead of Bus).

Thomas
-- 
  __/|__ | Dipl.-Math. Thomas Alrutz
/_/_/_/ | DLR Institute of Aerodynamics and Flow Technology
   |/    | Numerical Methods
     DLR | Bunsenstr. 10
         | D-37073 Goettingen/Germany

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf