[Beowulf] bizarre scaling behavior on a Nehalem

Craig Tierney Craig.Tierney at noaa.gov
Wed Aug 12 14:02:15 EDT 2009

Rahul Nabar wrote:
> On Wed, Aug 12, 2009 at 11:32 AM, Craig Tierney<Craig.Tierney at noaa.gov> wrote:
>> What do you mean normally?  I am running Centos 5.3 with 2.6.18-128.2.1
>> right now on a 448 node Nehalem cluster.  I am so far happy with how things work.
>> The original Centos 5.3 kernel, 2.6.18-128.1.10 had bugs in Nelahem support
>> where nodes would just start randomly run slow.  Upgrading the kernel
>> fixed that.  But that performance problem was either all or none, I don't recall
>> it exhibiting itself in the way that Rahul described.
> For me it shows:
> Linux version 2.6.18-128.el5 (mockbuild at builder10.centos.org)
> I am a bit confused with the numbering scheme, now. Is this older or
> newer than Craigs? You are right Craig, I haven't noticed any random
> slowdowns but my data is statistically sparse. I only have a single
> Nehalem+CentOS test node right now.

When you run uname -a you don't get something like:

[ctierney at wfe7 serial]$ uname -a
Linux wfe7 2.6.18-128.2.1.el5 #1 SMP Thu Aug 6 02:00:18 GMT 2009 x86_64 x86_64 x86_64 GNU/Linux

We did build our kernel from source, only because we ripped out
the IB so we could build from the latest OFED stack.


# rpm -qa | grep kernel

And see what version is listed.

We have found a few performance problems so far.

1) Nodes would start going slow, really slow.  However, when they started
to go slow they stayed slow and the problem was cleared by a reboot.  This
problem was resolved by upgrading to the kernel we use now.

2) Nodes are reporting too many System Events that look like single-bit
errors.  This again would show up as nodes that would start to go slow, and
wouldn't be resolved until a reboot.  We no longer things we had lots of
bad memory, and the latest BIOS may have fixed it.  We are upload that bios
now and will start checking.

The only time I was getting variability in timings was when I wasn't pinning
processes and memory correctly.  My tests have always used all the cores
in a node though.  I think that OpenMPI is doing the correct thing
with mpi_affinity_alone.  For mvapich, we wrote a wrapper script (similar to
TACC) that uses numactl directly to pin memory and threads.


Craig Tierney (craig.tierney at noaa.gov)
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list