[Beowulf] HPC fault tolerance using virtualization

John Hearns hearnsj at googlemail.com
Tue Jun 16 05:02:11 EDT 2009


2009/6/16 Kilian CAVALOTTI <kilian.cavalotti.work at gmail.com>

>
>
> I may be missing something major here, but if there's bad hardware, chances
> are the job has already failed from it, right? Would it be a bad disk (and
> the
> OS would only notice a bad disk while trying to write on it, likely asked
> to
> do so by the job), or bad memory, or bad CPU, or faulty PSU. Anything
> hardware
> losing bits mainly manifests itself in software errors. There is very
> little
> chance to spot a bad DIMM until something (like a job) tries to write to
> it.


What you say is very true.

However, you could look of correctable ECC errors, and for disks run a
smartctl test and see if a disk is showing
symtopms which might make it fail in future.
Or maybe look at the error rates on your ethernet or infiniband interface -
you might want to take that node out till it can be investigated (read-
reseating the cable!)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.clustermonkey.net/pipermail/beowulf/attachments/20090616/4c0f1644/attachment-0001.html>
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


More information about the Beowulf mailing list