[Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?

Rahul Nabar rpnabar at gmail.com
Thu Oct 22 20:56:16 EDT 2009

I wanted to get some opinions about if watchdog timers are a good idea
or not. I came across watchdogs again when reading through my IPMI
manual. In principle it sounds neat: If the system hangs then get it
to reboot after, say, 5 minutes automatically. But, in practice, maybe
it is a terrible idea.

Of course, one might say, a well configured HPC compute-node
shouldn't be getting to a hung point anyways; but in-practice I see a
few nodes every month that can be resurrected by a simple reboot.
Admittedly these nodes are quite senile.

The danger, seems to me: What if a node kept crashing (due to say,  a
bad HDD or something). Then a watchdog would merely keep rebooting
this node a hundred times. Not such a good thing.

Have you guys used watchdog timers? Maybe there is a way to build a
circuit-breaker around the principle so that if a node reboots
automatically more than 3 times then watchdog gives up?

If one had to do the watchdogging should one do the resets locally
using the IPMI local interface (hogs cpu cycles) or a central
Nagios-like system that could issue such a command. Many scenarios
seem possible. The prospect of a automated system doing a reboot at
3am seems more tempting than me having to do this manually.

Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list