[Beowulf] tracking down NaN's in an mpi job

Joe Landman landman at scalableinformatics.com
Thu Jan 8 12:08:36 EST 2004

Hi folks:

  I am trying to help a customer track down the spontaneous appearance
of NaN's in a StarCD job.  When submitted in parallel (using mpich 1.2.4
as supplied by Adapco) from their head node on their cluster, they get
instability in later iterations, and eventually NaN's start cropping
up.  When run on one CPU, it appears to be stable (on the head node and
compute nodes).     

  They are using the p4  mechanism with rsh.  I don't know if others
have seen anything like this.  Is it possible that messages are being
missed somehow in the parallel run?  The head node and compute nodes are
running two different versions of the Linux kernel (same distribution).

  Suggestions/hints welcome.  


Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list