Reliability analysis was RE: Windows HPC (@ Cornell)
Greg Lindahl
lindahl at keyresearch.com
Thu Nov 7 23:19:08 EST 2002
On Thu, Nov 07, 2002 at 06:34:47PM -0500, Tim Wait wrote:
> One aspect I haven't seen mentioned in this thread, except for
> Greg's oblique reference to Mosix, is that many (most?)
> of our clusters run parallel apps. Regardless of HA, if you have
> a node fail while running a parallel job, you have just blown your
> (supposed) 5 nines away; in my experience, it takes the user O(12+ hours)
> to restart the job. Is this deteriorating to HA vice beowulf?
It's not that hard for queue systems like PBS to detect and restart
jobs that fail due to machines dying -- this is a major quality of
implementation issue.
It still hurts you utilization, because you have wasted resources. But
at least the user doesn't have to do anything to get their answer;
they just get it later.
-- greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list