What happens with a failed node? (Scyld)

Sean Dilda agrajag at scyld.com
Thu Feb 7 14:19:14 EST 2002

On Thu, 07 Feb 2002, Tony Stocker wrote:

> Sean,
> Okay, if there's no mail sent and the slave node keeps rebooting itself (for 
> instance if its network connection is down) or if the slave node never comes 
> back up (it died).  What happens to the process that was running on it?  

The node reboots, thus all processes that were running on it die.

> Does the host node reassign it to another slave node after some period of 
> time?  What becomes of this "lost" process?  If there's no information 
> provided to a user that their process was lost when the node went down, and 
> the host node never reassigns it to be completed then it's conceivable that 
> an entire string of processing could be brought to a halt because of this 
> silent failure.  Since the host node maintains the master process list, it 
> should be aware that a process was running on a node that it now lists as 
> down.  What happens to the representation of this process in the table?

I'm not certain, but I beleive when a node goes down, all processes on
it exit as far as the master node is concerned.

As for the lost process getting reassigned, that all depends on what you
are using to spawn jobs.  If it knows enough to realize a node went down
and a process didn't exit properlly it could theoreticlly respawn the
job on another node.   Nothing we ship does this.  Most of our split
jobs are done with MPI.. with MPI, the current state of your job is the
processes on /all/ the nodes plus all data that is currently on the wire
(in transit over the network).  This makes it nearly impossible to just
restart one of the processes, and rework all the net connections, plus
keep all the internal data representations consistant.  This is why
checkpointing is best, it allows you to save data in a consistant state,
then reload it in that same state.  Without knowing the internal
workings of your program, its essentially impossible for the spawning
program/library to properlly do this for you.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: not available
URL: <http://www.clustermonkey.net/pipermail/beowulf/attachments/20020207/d72f17bf/attachment.sig>

More information about the Beowulf mailing list