memory nightmare

Jim Lux James.P.Lux at jpl.nasa.gov
Wed Jul 2 15:01:57 EDT 2003


At 08:01 AM 7/2/2003 -0700, Jack Wathey wrote:

>I need some advice about how to handle some ambiguous results from
>memtest86.  I also have some general questions about bios options
>related to ECC memory.

<big snip>

>My understanding is that ECC can correct only single-bit errors, and
>so would not help with the kind of multibit errors that have been
>troubling me lately.  But I have some basic questions on ECC that
>you might be able to answer (I've asked the motherboard maker's tech
>support, but to no avail!):


First off... you're correct that ECC (or, EDAC (error detection and 
correction)) corrects single bit errors, and detects double bit errors. 
It's designed to deal with occasional bit flips, usually from radiation 
(neutrons resulting from cosmic rays, background radiation from the 
packaging, etc.), and really only addresses errors in the actual memory cells.

If you have errors in the data going to and from the memory, ECC does 
nothing, since the bus itself doesn't have EDAC.

The probability of a single bit flip (or upset) is fairly low (I'd be 
surprised at more than 1 a day), the probability of multiple errors is 
vanishingly small. One rate I have seen referenced is around 2E-12 
upsets/bit/hr. (remember that you won't see an upset in a bit if you don't 
read it).. There are some other statistics that show an upset occurs in a 
typical PC-like computer with 256MB of RAM about once a month. Fermilab has 
a system called ACPMAPS with 156 Gbit of memory, and they saw about 2.5 
upsets/day (7E-13 upset/bit/hr)

Lots of interesting information at 
http://www.boeing.com/assocproducts/radiationlab/publications/SEU_at_Ground_Level.pdf 
and, of course, the origingal papers from IBM (Ziegler, May and Woods)

On all systems I've worked on over the last 20 years that used ECC, 
multiple bit errors were always a timing or bus problem, i.e. electrical 
interfaces. If you're getting so many problems, it's indicative of some 
fundamental misconfiguration or mismatch between what the system wants to 
see and what your parts actually do.  Maybe wait states, voltages, etc. are 
incorrectly set up?





>James Lux, P.E.

Spacecraft Telecommunications Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list