# [Beowulf] Infiniband and multi-cpu configuration

Craig Tierney Craig.Tierney at noaa.gov
Mon Feb 11 10:10:36 EST 2008

```Guillaume Michal wrote:
> Hi all,
> We set up our first cluster in our faculty this week. As we are new to cluster computing, there is a lot to learn. We performed
some linpack test using the OpenMPI benchmark available in the Rocks 4.3 distribution. The system is as follow:
>  - GigB ethernet with switch HP Procurve 2800 series
>  - 1 Master node: 500GB sata HDD, two intel quad core E5410 at 2.33GHz, 2GB mem
>  - 4 nodes each having: 80GB sata HDD, two intel quad core E5410 at 2.33GHz, 8GB mem
>
> First I'm a bit confused by the parameters P and Q in HPL.dat and how to use them properly. I noticed a 4P 2Q test is not
equivalent to a 2P 4Q, generally speaking it does not commute. Why? What is clearly P and Q then: P for number of processors per
nodes and Q for the number of nodes?
>

Visualize the problem as a big 2d matrix.  P and Q represent how the problem
is divided.  In general, the best is when the matrix is divided into even squares.
If your core count isn't n^2, then P and Q have to be different.  From experience
P should always be less than Q.  There may be a computational reason for that
(ie, longer strides in memory), but I am not sure.

> Secondly, what is the definition of processor for a quad core architecture? I suppose a quad core should be counted as 4 processors.

Yes, unless you are using a multithreaded BLAS library.  If you are,
you should have each node be 1 process.

>
> I launched Linpack using Ns=10000 and various configuration for P and Q. At the moment I got a maximum of 78 Gflops using P=8 Q=4
-> 32 processors.

You want to use as much available memory as possible.  I use N=10000 on a
single processor, single core run with 1GB.   You can figure out a good
value of N by the following formula:

Ns=sqrt(<Memory in Bytes per core>*<Number of cores>/8)

The 8 represents the size of a double.  For <Memory in Bytes per core>, I try
to use the largest number possible, typically about 90% of max.  You never
want to go into swap during these calculations (or, have it crash because
you have diskless nodes).

Ex: If you have 2GB per core for 32p, should use Ns as:

Ns=sqrt(1900*1024*1024*32/8)
Ns=89270

Honestly, this may be overkill.  At some point, the working memory set will
be large enough so that FP performance will be the bottleneck.  I would
what is going on.  In any case, using Ns=10000 is way to small.

>
> If I'm right the peak performance should be Rpeak= 4 cores x 4 floting point op per cycle x 2.33 Ghz x 8 quad cores = 298 Gflops.
> Which would lead to a test running at ~25% Rpeak.
>
> This is very low and I see 3 causes for the problem:
>     - I miscalculated Rpeak
>     - P and Q are not set properly
>     - there is a serious bottelneck
>

I think your Rpeak calculation is correct (not sure how many FPs the latest
Intel chips can do).

If increasing Ns doesn't help, run smaller cases on a per node bases (using
all available memory for each node).  If you don't get the exact same
answer on every node (or at least with 2%), you have a problem.  Figure out
what is wrong with the slow nodes.  Also, run the test multiple times
on the same node and verify consistent performance.

Craig

>
> Guillaume
>
>
> --Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

--
Craig Tierney (craig.tierney at noaa.gov)
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

!DSPAM:47b07d8012628298414181!

```