memory nightmare
Jim Lux
James.P.Lux at jpl.nasa.gov
Wed Jul 2 15:01:57 EDT 2003
At 08:01 AM 7/2/2003 -0700, Jack Wathey wrote:
>I need some advice about how to handle some ambiguous results from
>memtest86. I also have some general questions about bios options
>related to ECC memory.
<big snip>
>My understanding is that ECC can correct only single-bit errors, and
>so would not help with the kind of multibit errors that have been
>troubling me lately. But I have some basic questions on ECC that
>you might be able to answer (I've asked the motherboard maker's tech
>support, but to no avail!):
First off... you're correct that ECC (or, EDAC (error detection and
correction)) corrects single bit errors, and detects double bit errors.
It's designed to deal with occasional bit flips, usually from radiation
(neutrons resulting from cosmic rays, background radiation from the
packaging, etc.), and really only addresses errors in the actual memory cells.
If you have errors in the data going to and from the memory, ECC does
nothing, since the bus itself doesn't have EDAC.
The probability of a single bit flip (or upset) is fairly low (I'd be
surprised at more than 1 a day), the probability of multiple errors is
vanishingly small. One rate I have seen referenced is around 2E-12
upsets/bit/hr. (remember that you won't see an upset in a bit if you don't
read it).. There are some other statistics that show an upset occurs in a
typical PC-like computer with 256MB of RAM about once a month. Fermilab has
a system called ACPMAPS with 156 Gbit of memory, and they saw about 2.5
upsets/day (7E-13 upset/bit/hr)
Lots of interesting information at
http://www.boeing.com/assocproducts/radiationlab/publications/SEU_at_Ground_Level.pdf
and, of course, the origingal papers from IBM (Ziegler, May and Woods)
On all systems I've worked on over the last 20 years that used ECC,
multiple bit errors were always a timing or bus problem, i.e. electrical
interfaces. If you're getting so many problems, it's indicative of some
fundamental misconfiguration or mismatch between what the system wants to
see and what your parts actually do. Maybe wait states, voltages, etc. are
incorrectly set up?
>James Lux, P.E.
Spacecraft Telecommunications Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list