[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?

Rahul Nabar rpnabar at gmail.com
Tue Aug 11 18:31:08 EDT 2009

On Thu, Apr 9, 2009 at 11:35 AM, Douglas J.
Trainor<trainor at transborder.net> wrote:
> Rahul,
> I think Greg et al. are correct.  Does your SC1435 have a Delta Electronics
> switching power supply?  I bet you have a 600 watt Delta.
> Intel recently had problems with outsourced 350 watt "FHJ350WPS" switching
> power supplies that apparently affected 5% of some server lines.  These were
> loading imbalance problems between the 3.3 volt and 12 volt lines.  The
> affected power supplies had a minimum loading requirement that was not met.
>  The over-voltage protection circuit would kick in on the 3.3V line.
>  However, in these cases, the Intel machines would not reboot.  Intel is
> modifying the 3.3 volt minimum loading from 1.2 amps to 0.2 amps to fix the
> problem.

A while ago I had posted about these crashing SC1435's that I had. I
received lots of good suggestions on this group. Thanks all!

A lot of persistence with the vendor succeed in making their
Engineering team do long-run tests on one of our captured machines. It
needed to be tested for over one month and then they finally
replicated the failure. Whew! (In the past they had aborted tests way
before this time period)

They won't give me many internal details but apparantly it is caused
by an "hardware issue more likely caused certain motherboards with
Opterons" [sic]

So, thank again and it does seem that we finally got down to the cause
of this irritating problem! Just posted this in case it helps any
other SC1435 admins in a similar boat!



Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list