Scyld Nodes Freezing w/ SMP (fwd)
jim at ks.uiuc.edu
Fri Nov 14 10:58:10 EST 2003
This is very similar to problems we're seeing on our dual Athlon MP 2600+
cluster with Gigabyte GA-7DPXDW+ motherboards, Intel PRO/1000 MT Server
network cards, and Clustermatic 3 (on Red Hat 8). No solution, though.
On Thu, 13 Nov 2003, Timothy R. Whitcomb wrote:
> We are having an ongoing issue with our compute cluster, running Scyld
> 28cz4. It's a 5-node cluster (each node is dual-processor) with 4 compute
> nodes and 1 master node. We are running the Navy's weather model.
> The problem:
> The model runs fine when run on 4 processors (1 on each compute node).
> However, when I use the SMP capabilities of the machine and try to run on,
> say, 8 processors (using both CPUs on each compute node), everything will
> run fine for a while. Then, at a non-consistent time, a node will
> invariably freeze up. The cluster loses its connection to the
> node and I cannot communicate with it using any of the cluster tools -
> sometimes it will automatically reboot, but usually it requires me to go
> perform a hard reset on the node.
> However, I have found that in most cases if I run 2 jobs in parallel (i.e.
> 2 4-cpu processes, each using only 1 CPU on each node) things seem to work
> fine. Nodes may still freeze from time to time but not nearly as often.
> The hardware:
> The cluster was obtained pre-built from PSSC LabsEach compute node is a
> dual-processor Tyan MB with 2 Athlon MP CPUS. They
> are also equipped with 2 on-board NICs (lspci gives them as 3com 3c982
> Dual Port Server Cyclon rev 78 and the 3c59x kernel driver is used). We
> are using the BeoMPI 1.0.7 implementation of MPICH compiled with:
> --with-device=ch_p4 --with-comm=bproc
> (note that I had to recompile BeoMPI with the PGI compiler to get it to
> work with the model)
> Again, we use Scyld Beowulf 28cz4 for the operating system
> uname -a gives
> Linux nashi 2.4.17-0.18.18_Scyldsmp #1 SMP Thu Jul 11 19:26:54 EDT 2002
> i686 unknown
> _Please_ help if you have _any_ suggestions whatsoever. I am at the end
> of my rope, and this is presenting a serious impediment to our research!
> If you need more information, let me know and I will be happy to provide
> Tim Whitcomb
> University of Washington Applied Physics Lab
> twhitcomb at apl.washington.edu
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf