wathey at salk.edu
Wed Jul 2 11:01:25 EDT 2003
I need some advice about how to handle some ambiguous results from
memtest86. I also have some general questions about bios options
related to ECC memory.
First some background: I'm building a diskless cluster that will soon
grow to 100 dual athlon nodes. At present it has 10 diskless nodes
and a server. The boards are Gigabyte Technologies model GA7DPXDW-P,
and the cpus are Athlon MP 2200+. In April I bought 69 1 gigabyte
ecc registered ddr modules from a vendor who had twice before sold me
reliable memory. This time, however, the memory was bad. Testing in
batches of 3 sticks per motherboard, nearly 100% failed memtest86,
and some machines crashed or would not even boot. They replaced
all 69 sticks. Of this second batch, about 60% failed memtest86,
and the longer I tested, the more would fail. I again returned
them all. In both of these batches, the failures were numerous,
often thousands or hundreds of thousands or even millions of errors.
The errors were usually multibit errors, where the "fail bits" were
things like 0f0f0f0f or ffffffff. The most commonly failing test
seemed to be test number 6, but others failed, too.
I am now testing the third batch of 69 sticks. I decided, more-or-less
arbitrarily, that I would consider them good if they passed 48 hours
of memtest86. Testing in batches of 3 per board, all but 6 groups of
3 sticks passed 48 hours of memtest86. I have been able to identify a
single failing stick in 2 of the 6 failed batches by testing 1 stick
per motherboard. I am still testing the others, 1 stick per board,
but so far none has failed.
So here is the problem: I have these 4 batches, of 3 sticks each,
which failed memtest86 when tested in batches of 3. The failures did
not occur on each pass of memtest's 16 tests. Instead the sticks would
pass all of the tests for several passes. In one case the failure
did not occur until after memtest86 had been running, without error,
for 42 hours on that machine. That particular failure was in a single
word in test 6. The worst of the 4 batches failed at 14 memory
locations. I have now been testing 9 of these 12 suspect sticks,
1 stick per motherboard, for several days. Several have now passed
more than 100 hours of memtest86 without error.
Can I trust them?
Should I keep them or return them?
If I return them, how long must I run memtest86 on the replacements
before I can trust those?
Can I trust the 55 or so sticks that passed 48 hours of memtest86 in
batches of 3?
The vendor has been making a good-faith effort to solve the problem,
and has even agreed to refund my money for the whole purchase if I'm
not happy with it.
What would you do in this situation?
Those are the most urgent questions for which I need answers, but I
have a few others of a more general nature:
Is there a specific vendor or brand of memory that is much more
reliable than others? Since the above-described ordeal, I've heard
that Kingston has a good reputation. Anyone care to endorse or
refute that? Any other good brands/vendors you care to mention?
My understanding is that ECC can correct only single-bit errors, and
so would not help with the kind of multibit errors that have been
troubling me lately. But I have some basic questions on ECC that
you might be able to answer (I've asked the motherboard maker's tech
support, but to no avail!):
In the bios for my GA7DPXDW-P motherboards, there are these 4
alternatives for the SDRAM ECC Setting:
Correct + scrub
I'm pretty sure I understand what 'Disabled' does. Can anyone
explain to me what the others do, and how they differ? Also, if ECC
correction is enabled, does this slow down the machine in any way?
Is there any disadvantage to having ECC correction enabled?
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf