[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?
rpnabar at gmail.com
Mon Apr 6 15:35:10 EDT 2009
On Mon, Apr 6, 2009 at 1:32 PM, Frank Gruellich
<frank.gruellich at navteq.com> wrote:
> IMHO SC1435 are some kind of low-cost metal from DELL. I would not use
> them if I want a reliable system. Especially in HPC where one failed
> systems ruins your whole (maybe long running) job.
Thanks for the comments Frank. I did not realize that the SC1435
wasn't suitable for HPC. I know it is one of the lower end systems
without RemoteManagement nor hot-swappable-hardware etc. (but we don't
really need the frills) but I was under the impression that this model
is fairly common in other HPC installations. Maybe we were wrong, in
> The DELL support is a bit tricky. We have Silver or Gold support for
> most systems, I don't know how they work for lower levels. I can't
> complain about Gold. For Silver they always try to make us doing stuff
> like cross testing memory, CPU or other things. (The most interesting
> request is to do a BIOS update to cure a (obviously) memory problem.
> The machine went 2 years fine with the old BIOS -- memory combination
> and suddenly it complains about it?) While I really like to do such
> hardware games I just don't have the time for it. If you keep refusing
> these requests, eventually they give up and send a technican replacing
> different pieces of hardware.
I ought to check if we are "Gold" or "Silver" or none. Yes, the BIOS
update gig I am familiar with. I can quote their debug checklist from
memory almost. They made me confirm and update BIOSes too. It was
funny especially since it hadn't been even a month after we bought
them but the tech insisted our BIOS was *not* up-to-date back then. We
fixed it but I always wonder why they do not just ship out up-to-date
versions of the BIOS!
> We use CentOS for most installation and DELL support never complained
> about it. And IMHO the OS should be able to cause an error detected by
> the management board.
Exactly, my opinion. It seems clearly a hardware level fault and the
OS angle seems mostly smoke-and-mirrors to me. I cannot explain why
the system will not reboot by pressing the reboot button if it were a
simple software crash.
> I have dset reports in place, before calling support, because they
> always request them. That speeds up chit-chat a bit.
Yes, dset and sosreports seem standard requests.
> That's another problem: IMHO your university should have a dedicated guy
> taking care about computer system, someone who has the time to deal with
> DELL support and so on. 23 machines don't give a full time job, but
> maybe someone who's taking care about some other Linux installation
> already. It's not a good idea to have just some grad-student doing that
> job part-time (no offense). I know that reallity looks bad.
Ah well, one does what one needs to! :) These are dedicated research
machines for our computational chemistry group so they will be running
code that eventually (hopefully!) puts results into my PhD thesis! :)
Most parts of system administration are fun except maybe having to
deal with stubborn vendors!
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the Beowulf