[Beowulf] ECC RAM or not?
james.p.lux at jpl.nasa.gov
Wed Feb 18 01:26:38 EST 2004
----- Original Message -----
From: "Mark Hahn" <hahn at physics.mcmaster.ca>
To: "Jim Lux" <James.P.Lux at jpl.nasa.gov>
Sent: Tuesday, February 17, 2004 9:36 PM
Subject: Re: [Beowulf] ECC RAM or not?
> > > some kinds of codes are "rad hard", in the sense that if a failure
> > > you a possibly-wront answer, you can just check the answer.
> > My practical experience with DRAM designs has been that bit errors are
> > likely due to noise/design issues than radiation induced single event
> > upsets.
> understood. then again, you're using deliberately selected rad-hard-ware,
Nope... that was off the shelf DRAMs in a commercial environment (in 1980ish
time frame, so they were none too dense DRAMs, either.. 256kB on a board I
think, many, many, pieces.. probably 64kbit parts..)
> I was mostly thinking about a talk I saw by the folks who care for ASCI-Q,
> which is in Los Alamos. they say that the altitude alone is worth a 14x
> increase in particle flux, and that this caused big problems for them with
> a particular register on the ES40 data path that was not ecc'ed.
Indeed.. ECC on memory is only part of the problem.. you really need ECC on
address and data lines for full coverage (or, more properly EDAC).. The
classic paper on altitude effects was done by folks at IBM, where they ran
boards in NY and in Denver and, underground in Denver. Good experimental
> > Back in the 80's I worked on a Multibus system where we used to get
> > double bit errors in 11/8 ecc several times a week. Everyone just said
> > "well, that's why we have ECC" until I did some quick statistics on what
> > ratio between single bit (corrected but counted) and double bit errors
> > should have been. Such high rates defied belief, and it turned out to be
> > bus drive problem.
> makes sense. to be honest, I don't see many single-bit errors even,
> but today we've only < 200 GB ram online. inside a year, it'll probably
> be more like 2TB, so maybe things will get more exciting ;)
It's a very mixed bag, depending on what's causing the errors. If it's
radiation, smaller feature sizes mean that there's a smaller target to hit,
and the amount of energy transferred is less (of course, less energy is
stored in the memory cell, too)
> we're also pretty much at sealevel, with lots of building over us.
> reactor next door, though ;)
Type of particle, and it's energy, has a huge effect on the SEU effects. I
would maintain, though, that run of the mill timing margin effects,
particularly over temperature; and EMI/EMC effects are probably a more
important source of bit hits in modern computers.
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf