[Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?

Rahul Nabar rpnabar at gmail.com
Fri Oct 23 15:44:27 EDT 2009

On Fri, Oct 23, 2009 at 1:23 PM, Greg Lindahl <lindahl at pbm.com> wrote:
> On Fri, Oct 23, 2009 at 01:01:05PM -0500, Rahul Nabar wrote:
>> 2. Some errors are hardware precipitated. Aging, out-of-warranty
>> aging, hardware can sometimes need such a reboot compromise for
>> one-off random errors.
>> Maybe all the "nice" clusters out there never have this issue but for
>> me it is fairly common. Just confessing.
> Why, exactly, are you assuming that your freezes are one-off random
> errors due to aging hardware? Sounds like you're either guessing, or
> you _are_ doing forensics, but aren't calling it forensics.

Greg. You are right. My bad. In hindsight, that doesn't make much sense. Sorry.

Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list