[Beowulf] Tyan S2882
bill at cse.ucdavis.edu
Tue Sep 26 21:58:28 EDT 2006
> We are currently deploying Tyan S2882 Dual Opteron Boards, and we have
> found the system to be quite unstable. After BIOS updates and kernel
Unstable when? When idle? Under heavy cpu load? Under heavy I/O?
During Install? Which OS/Dist/Kernel?
> changes we still get random kernel panics when under load.
What kind of load? How big is the power supply? What kind of CPU?
> Anyone has these boards and has found any solution, as I have mailed
> other users of this board who also reported random kernel panics and
> an unusual number of hardware problems.
How many are unreliable? 1 of 1? 10 of 10? 64 of 64?
> So far we have solved the
> - broken BIOS problem with an update to the most recent BIOS.
> - Discovered that some power supplies can produce problems
Power supplies do degrade over time, especially if overloaded.
> - FS corruption due to a firmeware problem in a RAID hardware board
Indeed, hardware RAID problems seem shockingly common..
> - MCE chipkill errors (non-fatal) due to apparent bad RAM
Detected how? New memory passed 24 hours with memtest86? Are you using
ram certified as compatible with the 2882?
> To be solved:
> - random kernel panics that take out the logging even when all debug
> flags are set in the kernel, as it fails to sync the disc during the
> kernel panic.
Could log it to serial.
I've got at least 32 of these, and they seem pretty reliable.
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf