Scyld Nodes Freezing w/ SMP (fwd)
larry at pssclabs.com
Sat Nov 15 14:01:53 EST 2003
As this cluster has been running for over a year without any crashes, I
would suspect that the hardware is fine. In general, the Tyan 2466 supports
SMP applications fairly well. We have installed many Beowulfs using the
Tyan 2466 without any SMP issues. However, most customers use Redhat.
Have you tried running the model with both processors only on the head node
? If that fails, you may want to install a current version of Red Hat and
see if that works better.
At 05:51 PM 11/13/2003 -0800, you wrote:
>We are having an ongoing issue with our compute cluster, running Scyld
>28cz4. It's a 5-node cluster (each node is dual-processor) with 4 compute
>nodes and 1 master node. We are running the Navy's weather model.
>The model runs fine when run on 4 processors (1 on each compute node).
>However, when I use the SMP capabilities of the machine and try to run on,
>say, 8 processors (using both CPUs on each compute node), everything will
>run fine for a while. Then, at a non-consistent time, a node will
>invariably freeze up. The cluster loses its connection to the
>node and I cannot communicate with it using any of the cluster tools -
>sometimes it will automatically reboot, but usually it requires me to go
>perform a hard reset on the node.
>However, I have found that in most cases if I run 2 jobs in parallel (i.e.
>2 4-cpu processes, each using only 1 CPU on each node) things seem to work
>fine. Nodes may still freeze from time to time but not nearly as often.
>The cluster was obtained pre-built from PSSC LabsEach compute node is a
>dual-processor Tyan MB with 2 Athlon MP CPUS. They
>are also equipped with 2 on-board NICs (lspci gives them as 3com 3c982
>Dual Port Server Cyclon rev 78 and the 3c59x kernel driver is used). We
>are using the BeoMPI 1.0.7 implementation of MPICH compiled with:
>(note that I had to recompile BeoMPI with the PGI compiler to get it to
>work with the model)
>Again, we use Scyld Beowulf 28cz4 for the operating system
>uname -a gives
>Linux nashi 2.4.17-0.18.18_Scyldsmp #1 SMP Thu Jul 11 19:26:54 EDT 2002
>_Please_ help if you have _any_ suggestions whatsoever. I am at the end
>of my rope, and this is presenting a serious impediment to our research!
>If you need more information, let me know and I will be happy to provide
>University of Washington Applied Physics Lab
>twhitcomb at apl.washington.edu
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf