[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?

Mark Hahn hahn at mcmaster.ca
Mon Apr 6 13:37:11 EDT 2009

> I put these machines into production in Aug '08. Within a month we had
> the first machine go bad. They hang with a amber LED and the

what's the term of the warranty?

> logging-module clearly logs an error of the sort: "Voltage sensor
> (VCORE) critical error. State asserted CPU2". Machine needs a
> power-cycle physically from back-plane to restart

well, I think it's worth asking whether you're sure your power feed
is in good shape.

> Do others face similar vendor issues? If 6 out of 23 machines go bad
> within 8 months of an order can I expect the vendor to exchange the
> rest too?

IMO, no.  not without some indication that the fault is well reproducable
and actually fault is theirs...

> And a single bad machine causes larger problems since it usually
> results in disrupting jobs that run spanning across a bunch of nodes
> too.

well, if you bought it as a cluster, not just some nodes,
then you might have a case that the cluster is not working.
the problem with replicability is that it permits fingerpointing.

> Just wanting to hear more about how I can best resolve this issue. For
> our future purchases would changing vendors help? Is there any trend

buying an extended warranty might help.  buying a shrink-wrapped cluster
might help too.

> behind the quality of services from different vendors? I have only
> been exposed to Dell and its frustrating customer-service so far; are
> HP / IBMd or any others better or worse or uncorrelated?Of course, I

my organization has been an HP shop, more or less, since inception in 2001,
for reasons I won't go into.  I believe they've done well by us - I could 
criticize prices, some hardware design issues, etc, but they're quite 
responsible and responsive to problems.
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

More information about the Beowulf mailing list