[Beowulf] ECC exerciser/exorciser?
prentice at ias.edu
Mon Jan 26 11:00:54 EST 2009
Mark Hahn wrote:
> Hi all,
> we're having some trouble with nodes showing high ECC corrected error (CE)
> counts. I'm wondering whether you have any wisdom on the following:
> - first, how would you go about setting a threshold for how high is an
> acceptable CE count? we by default are using the mce module, which by
> default polls at 1Hz. my thinking is that if we get overflow events
> (the multiple error bit is set), then it's too fast.
> - do you have or know of a good exerciser for testing ECC's? yes, I
> know about memtest86, but I'm more curious about a load that could be
> run under
> linux. my thinking is that ecc's are triggered by bad reads, so something
> which allocates all memory and then continually reads it would be best.
I find just running a large HPL job across the cluster will find errors
It may take a couple of days, but it will. I've run breakin for days on
end, and not found any memory errors, but when I run a full-blown hpl
job, I find memory errors right away (if right away = a couple of days)
Breakin runs xhpl on every core, but I'm not sure if it's MPI-based, or
if every core is running an independent job. Maybe the breakin
developer(s) can pipe in on how it stresses the RAM.
Hope that helps.
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the Beowulf