Fault tolerance and MPI

Carl_Notfors at vdgc.com.sg Carl_Notfors at vdgc.com.sg
Mon Feb 5 02:23:42 EST 2001



Our computational model is quite simple.  We have a master node and a
number of slave nodes.  All communication is between the master and the
slaves, ie. no internode communication, so all communication is done with
MPI_Send and MPI_Recv (we are using LAM/MPI).

The problem with MPI is that there is no fault tolerance, if a slave node
"dies" the whole process goes down.  According to the LAM documentation it
should be possible to achieve some fault tolerance but we have as yet not
tried this.

Is there anyone who has got this working?  Is there fault tolerance in any
othe MPI implementations?  Would it be better to use PVM if you want fault
tolerance?


Carl


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list