[Beowulf] ECC RAM or not?

Jim Lux james.p.lux at jpl.nasa.gov
Tue Feb 17 23:37:00 EST 2004

> some kinds of codes are "rad hard", in the sense that if a failure gives
> you a possibly-wront answer, you can just check the answer.

My practical experience with DRAM designs has been that bit errors are more
likely due to noise/design issues than radiation induced single event
upsets. Back in the 80's I worked on a Multibus system where we used to get
double bit errors in 11/8 ecc several times a week.  Everyone just said
"well, that's why we have ECC" until I did some quick statistics on what the
ratio between single bit (corrected but counted) and double bit errors
should have been. Such high rates defied belief, and it turned out to be a
bus drive problem.

 that definition
> pretty much excludes traditional supercomputing, and certainly all
> physics-based simulations.  searching/optimization stuff might work well
> in that mode, though rechecking only catches false positives, doesn't
> recover from false negatives.  I suspect that doing ECC is cheaper than
> messing around with this kind of uncertainty, even for these specialized

There are a number of algorithms which have inherent self checking built in.
In the accounting business, this is why there's double entry, and/or
checksums. In the signal processing world, there are checks you can do on
things like FFTs, where total power in should equal total power out.

> if you have 32 low-overhead nodes containing 20K-hour power supplies,
> need to think about doing a replacement per month.
> if you have a 1M-hour disk in each of 1100 nodes, you shouldn't be shocked
> to get a couple failures a week.

Shades of replacing tubes in Eniac or the Q-7A

MIL-HDBK-217A is the "bible" on these sorts of computations.

Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list