[Beowulf] Stress / torture test cluster hardware

Andrew Shewmaker agshew at gmail.com
Sun Oct 8 00:26:52 EDT 2006

On 10/7/06, Nico Mittenzwey <nico.mittenzwey at s2001.tu-chemnitz.de> wrote:

> "memtest86" http://www.memtest86.com/

If you are using a large amount of ECC memory, you may find it
necessary to keep track of Single Bit Errors and look for "weak"
DIMMs using something like the EDAC/bluesmoke drivers
(http://bluesmoke.sourceforge.net) and a userspace memory

On a 264 node cluster with 8-16GB RAM, I had to weed out
weak memory over a period of months.  A given node
running a memory tester would show no SBEs for a day or
more, then suddenly show a huge burst.  Another system
might have a more consistently incrementing SBE counter.
Now, ECC was working, so applications like the memory
tester weren't having problems.  However, I couldn't reliably
reboot this cluster because the BIOS would often refuse to
boot unless a node was powered off for say, five minutes.

I wrote up more of this experience on the Real World Tech


It looks like the latest version of Stresslinux has a
kernel, so it should have the EDAC drivers included.  Plus,
it has the userspace memtester.  Memtest86 is nice, but it
didn't support checking the ECC counters on the the cluster
I mention above.  It couldn't help me weed out DIMMs at

See http://agenda.clustermonkey.net/index.php/Memory
for some more info about this (links to LWN articles and a
list of supported drivers in 2.6.16).

I wasn't aware of the EDAC wiki until I saw it linked
from the bluesmoke page just now.  It will tell you
what chipset support is coming.


I would be interested to hear about other what kind of
single bit error rates other people see on their clusters.

Andrew Shewmaker
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list