[Beowulf] Memory errors poll

Mark Hahn hahn at mcmaster.ca
Tue Mar 31 00:14:06 EDT 2009

>> we replace dimms which show > 1000 corrected ECCs per day
>> (or any overflows, for which counts are inaccurate, or any uncorrectable
>> errors.)
> These systems are a couple of generations old, right?

waaait a minute - I think I gave the wrong impression.  we have about
13 TB of this gen hardware (yes, from 3 years ago).  our observed rate
is that at a given moment, a fraction of 1% of the nodes have any EC's at
all.  our vendor is happy to replace dimms that have a nontrivial rate,
and there aren't a lot of nodes that have had this done.

one interesting thing is that during a 3-year period, seems like about 1% 
of nodes developed higher EC rates that disappeared when the dimms were 
reseated.  I wonder whether this was the result of thermal cycling...

> I think I have Linux set up to record single-bit errors, and the rate

using edac?  I toyed with mcelog before that, but never really got much
traction until edac came with an updated kernel.
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

More information about the Beowulf mailing list