[Beowulf] Re: ECC Memory and Job Failures (Huw Lynes)
gerry.creager at tamu.edu
Thu Apr 23 17:16:45 EDT 2009
David Mathog wrote:
> Huw Lynes <lynesh at cardiff.ac.uk> wrote:
>> Apparently someone ran a large cluster job with both ECC and none-ECC
>> RAM. They consistently got the wrong answer when foregoing ECC.
> There were not very many details given. I would not rule out the
> possibility that the nonECC memory was slightly faulty, and that the
> observed errors had nothing to do with gamma rays at all. A better test
> would have been to use the same ECC memory for both tests, and to turn
> ECC memory correction on and off in the BIOS.
Where's Jim Lux. I'm sure he's an opinion on this, too...
Cosmic ray hits are, if I recall correctly, an improbable event at the
earth's surface on the order of 1/1e13 sec (but I'm doing this from
memory and IT may have taken a hit). In spaceborne applications,
however, the potential for random high energy particle hits is
significantly higher. And it's not just memory, although that tends to
be more susceptible. CPUs are also at risk. CMOS parts tend to
tolerate these events better than a lot of others than NMOS. There are
a lot of old CPUs and memory designs for spaceflight even today.
I tend to buy the theory that there's something wrong with the non-ECC
components, rather than thinking there's a cosmic ray with you r name on it.
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the Beowulf