[Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?

Rahul Nabar rpnabar at gmail.com
Fri Oct 23 14:01:05 EDT 2009

On Fri, Oct 23, 2009 at 12:35 PM, Mark Hahn <hahn at mcmaster.ca> wrote:
>> My philosophy though would be to leave a machine down till the cause of
>> the crash is established.
> absolutely.  this is not an obvious principle to some people, though:
> it depends on whether your model of failures involves luck or causation ;)
> and having decent tools (IPMI SEL for finding UC ECCs/overheating/etc,
> console logging for panics) is what lets you rule out bad juju...

Other factors that sometimes make me violate this principle of "always
establish a crash cause":

1. Manpower to debug. Let's say the error has a cause but is
relatively infrequent. I might achieve a higher uptime by a simple
reboot until I get the time to fight this particular fire. People feel
nicer to have a crashed node humming away as soon as possible rather
than waiting for me to get the time to have a look at it and come to a
definite diagnosis. Forensics takes time.

2. Some errors are hardware precipitated. Aging, out-of-warranty
aging, hardware can sometimes need such a reboot compromise for
one-off random errors.

Maybe all the "nice" clusters out there never have this issue but for
me it is fairly common. Just confessing.


Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list