[Beowulf] ECC Scrub, which setting?

Jim Lux james.p.lux at jpl.nasa.gov
Tue May 20 00:51:22 EDT 2008


Quoting Mark Hahn <hahn at mcmaster.ca>, on Mon 19 May 2008 08:47:46 PM PDT:

>>> It is currently set to
>>> Basic, which scrubs every 5.24 ms.
>>
>> You'll have to look in the manual to find out what that means -- it's
>> probably "do a small amount of scrubbing every 5.24 ms". And you have
>
> I expect it's the interval between cacheline-sized (64B) scrubs. as
> such, I think it's much too low (4G ram in 98 hours!)


too low, based on what assumption for upset rate?

If the rate is, say, 1E-13 upset/bit/day, and you've got 1 Gbyte  
(roughly 1E10 bits), you're looking at 1E-3 upsets/day.  Since the ECC  
will correct the error, what you're really fighting with the scrubbing  
is the probability of a *double* error in the same word.  Depending on  
the error statistics, i.e. do you get multiple bit errors in the same  
word.. (unlikely with most memory layout schemes which spread words  
across the geometry, but, you never know...)

And if you DO get a double error, the ECC code will detect it, and you  
can halt or take corrective measures (i.e. throw away that work  
package's output, and restart from a checkpoint, etc.)


Even if the rate is much higher.. say 1E-12 upset/bit/hour.. about 200  
times higher than the 1E-13 I used above.  And say you've got 4Gbyte  
of ram.. now you're looking at a single (fully corrected) upset per  
day. The probability of a undetected error is still quite low  
(requiring at least 3 errors), and the probability of a double bit  
error causing an abort (within the 100 or so hours you calculated for  
the scrub) is probably low enough that it wouldn't materially affect  
your computation rate.  And this assumes that your OS doesn't  
autoscrub on a detected Single Bit Error, perhaps because the hardware  
doesn't support it.


OTOH, if the ECC is protecting you from a lousy mobo design with  
timing glitches and crosstalk between traces manifesting as errors...



Jim

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



More information about the Beowulf mailing list