[Beowulf] Memory Testing?

Mark Hahn hahn at mcmaster.ca
Sat Aug 13 22:22:52 EDT 2011


> I'm curious if anyone has any experience with ECC uncorrectable errors
> (specifically not the identification of), but which specific dimm in
> the chassis it's pointing to.

we've had good luck using EDAC to pin down bad dimms -
at least those that that cause _correctable_ errors.
our uncorrectable errors trigger panics.  I suppose that's selectable,
though I guess you could turn that off (/sys/module/edac_mc/panic_on_ue)

> The mcelog in linux doesn't seem to report the dimm slot correctly on
> my supermicro boards.

I prefer the hardware-topology-based naming that edac uses
(controller, channel, chipselect).  I guess recent versions of edac
have a user-space tool that will translate that for you (but of course,
you have to verify the topo-to-label mapping yourself anyway.)

regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



More information about the Beowulf mailing list