[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?

Reuti reuti at staff.uni-marburg.de
Wed Nov 3 12:44:39 EST 2004


> I must say though that from what I know checkpointing/restarting
> serial codes is OK.
> Checkpointing parallel jobs is problematic, and from what I've read
> not recommended (the various processes are passing
> messages, and how do you checkpoint in a consistent state?).

I would send a signal from SGE only to the head node of a let's say MPI 
job. This rank 0 job has to set some special fields and broadcast this 
to the slave processes. The slaves must check this from time to time and 
send their state to the head node (and shut down in a proper way), which 
is performing the storing of the information in any checkpointing place 
on a shared file system (maybe we get different nodes the next time). I 
think it's possible to program it (when it's included in the design of 
the program), but adding it later to an already existing program is not 
so easy. - Reuti

Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list