Memory
Error Detection And Correction
A cluster of systems with large amounts of RAM provides system integrators and administrators with an opportunity to become familiar with Soft Errors.
According to arch/*/kernel/mce.c and arch/*/kernel/traps.c, Linux kernels older than 2.6.16 will either see an uncorrectable bit error as a Machine Check Exception (MCE), print out a message with the DIMM bank, and panic; or as an NMI and continue on with a "Dazed" message. An NMI would be seen if MCE panic was disabled with the mce=off boot parameter.
There are new capabilites beginning with the 2.6.16 kernel. The code from the EDAC project was merged into the kernel as optional modules. The modules provide counters for correctable and uncorrectable errors, the ability to reset counters through sysfs, a reset counter - seconds since last reset, etc.
* short LWN EDAC writeup * edac.txt from the 2.6.16 kernel docs
The Linux EDAC modules support the following memory controllers:
* AMD 76x * Intel e752x * Intel e7xxx * Intel 82860 * Intel D82875P * Radisys 82600