Scyld Nodes Freezing w/ SMP (fwd)

Timothy R. Whitcomb twhitcomb at apl.washington.edu
Thu Nov 13 20:51:47 EST 2003


We are having an ongoing issue with our compute cluster, running Scyld
28cz4.  It's a 5-node cluster (each node is dual-processor) with 4 compute
nodes and 1 master node.  We are running the Navy's weather model.

The problem:
The model runs fine when run on 4 processors (1 on each compute node).
However, when I use the SMP capabilities of the machine and try to run on,
say, 8 processors (using both CPUs on each compute node), everything will
run fine for a while.  Then, at a non-consistent time, a node will
invariably freeze up.  The cluster loses its connection to the
node and I cannot communicate with it using any of the cluster tools -
sometimes it will automatically reboot, but usually it requires me to go
perform a hard reset on the node.

However, I have found that in most cases if I run 2 jobs in parallel (i.e.
2 4-cpu processes, each using only 1 CPU on each node) things seem to work
fine.  Nodes may still freeze from time to time but not nearly as often.

The hardware:
The cluster was obtained pre-built from PSSC LabsEach compute node is a
dual-processor Tyan MB with 2 Athlon MP CPUS.  They
are also equipped with 2 on-board NICs (lspci gives them as 3com 3c982
Dual Port Server Cyclon rev 78 and the 3c59x kernel driver is used).  We
are using the BeoMPI 1.0.7 implementation of MPICH compiled with:
--with-device=ch_p4 --with-comm=bproc
(note that I had to recompile BeoMPI with the PGI compiler to get it to
work with the model)
Again, we use Scyld Beowulf 28cz4 for the operating system
uname -a gives
Linux nashi 2.4.17-0.18.18_Scyldsmp #1 SMP Thu Jul 11 19:26:54 EDT 2002
i686 unknown

_Please_ help if you have _any_ suggestions whatsoever.  I am at the end
of my rope, and this is presenting a serious impediment to our research!
If you need more information, let me know and I will be happy to provide
it!

Thanks...

Tim Whitcomb
Meteorologist
University of Washington Applied Physics Lab
twhitcomb at apl.washington.edu
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list