[Beowulf] PowerEdge SC 1435: Unexplained Crashes.
rpnabar at gmail.com
Fri Oct 17 11:37:17 EDT 2008
On Fri, Oct 17, 2008 at 10:22 AM, Nifty niftyompi Mitch
<niftyompi at niftyegg.com> wrote:
> Check the baseboard management controller log (Ctrl+E).
> Tell us what software distribution you are running and any changes that might have
> been made (no matter how small). What is the default run level (is X11 active/ not active).
> Are power saving options enabled in the BIOS?
Distro: Centos 5.2.
Linux node03 2.6.18-92.el5 #1 SMP Tue Jun 10 18:51:06 EDT 2008 x86_64
x86_64 x86_64 GNU/Linux
No changes made to standard kernel. X11 not active. Power saving not enabled.
> Also what hardware monitor software are you running. I have seen system admins add
> their own package to systems only to find that RHEL has an equivalent package
> that uses different device drivers for the same hardware with impossible to diagnose
> results. Custom kernel?
I am not sure what you mean by "hardware monitor software". I do not
recall installing anything special.
> Disable cpuspeed, hardware monitor and hardware control software to see if stability changes.
There are a bunch of Dell utilities that come up at boot-time. BMC,
RAID, Bradcom-PXE, Remote manage controllers. You want me to disable
Stability has already changed. After I swapped motherboard+cpu. No
more dead nodes in over 2 weeks now (yay!) But I just want to make
sure this won't be a recurring problem with these SC1435's before we
go in for our next expansion.
> What additional hardware is in the chassis?
None other than what came with the original Dell units. These are only
2 months old now. They do have dual NICs and no CDROMs. Have disks.
Linked to a Dell KVM via a SIP module. No min-n-matching of Hardware.
Was a monolithic Dell order.
> The "poweredge indicator turning orange" tells me that the problem was detected by the
> system and there should be a hint in the log. The orange state is sticky and
> needs to be cleared....
Funny. It wasn't sticky for me. When I rebooted the orange light
cleared. I did not need to reset it via the BIOS. Unfortunately the SC
series does not have the tiny LCD for an error display.
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf