[Beowulf] Memory Testing?
hahn at mcmaster.ca
Sat Aug 13 22:22:52 EDT 2011
> I'm curious if anyone has any experience with ECC uncorrectable errors
> (specifically not the identification of), but which specific dimm in
> the chassis it's pointing to.
we've had good luck using EDAC to pin down bad dimms -
at least those that that cause _correctable_ errors.
our uncorrectable errors trigger panics. I suppose that's selectable,
though I guess you could turn that off (/sys/module/edac_mc/panic_on_ue)
> The mcelog in linux doesn't seem to report the dimm slot correctly on
> my supermicro boards.
I prefer the hardware-topology-based naming that edac uses
(controller, channel, chipselect). I guess recent versions of edac
have a user-space tool that will translate that for you (but of course,
you have to verify the topo-to-label mapping yourself anyway.)
regards, mark hahn.
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the Beowulf