[Beowulf] ECC RAM or not?

Mark Hahn hahn at physics.mcmaster.ca
Tue Feb 17 18:20:12 EST 2004

> For a low-cost cluster, would you insist on ECC RAM or not, and why?

how low-cost, and what kind of code?

technically, the chances of seeing dram corruption depends on how much
ram you have, and how much you use it (as well as environmental factors,
such as altitude, of course!)  for a sufficiently low-cost cluster,
you'd expect to have relatively little ram, and little CPU power to churn it,
and therefore low rate of bit-flips.  otoh, you can bet that the recent 
ECC upgrade of the VT cluster had a significant real cost (probably eaten
by vendors for PR reasons...)

some kinds of codes are "rad hard", in the sense that if a failure gives
you a possibly-wront answer, you can just check the answer.  that definition
pretty much excludes traditional supercomputing, and certainly all
physics-based simulations.  searching/optimization stuff might work well
in that mode, though rechecking only catches false positives, doesn't 
recover from false negatives.  I suspect that doing ECC is cheaper than 
messing around with this kind of uncertainty, even for these specialized codes.

> My inclination would be to always use ECC for anything, but it looks
> as if there is no such thing as an inexpensive motherboard which also
> supports ECC RAM.  Either you can have a cheap motherboard (well under
> $100) with no ECC, or a pricey (well over $100) motherboard with ECC.

well, you're really pointing out the difference between desktop and 
workstation/server markets.  for instance, there's not much physical
difference between the i875 and i865 chipsets, but the former shows 
up in $200 boards that need a video card, and the latter in $100 ones
that have integrated video.

> Am I mistaken about this, are are there really no exceptions to this
> seeming "ECC motherboads are always expensive" rule?

it's a marketing/market-driven phenomenon.

it's useful to work out the risks when you make this kind of decision.

if you have 32 low-overhead nodes containing 20K-hour power supplies, you'll
need to think about doing a replacement per month.

if you have a 1M-hour disk in each of 1100 nodes, you shouldn't be shocked
to get a couple failures a week.

if 1100 nodes with 4G but no ECC see a two undetected corruptions a day,
then 32 nodes with 1G will go a couple months between events...

regards, mark hahn.

Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list