Scyld Nodes Freezing w/ SMP (fwd)
landman at scalableinformatics.com
Fri Nov 14 12:29:52 EST 2003
I commented to Timothy offline that I am seeing stability problems on
my customers machines based upon Tyan 2466 MB's.
Some success via MB replacement (after isolating subsystems through
memory tests/exchange, IO loads, net loads,...). Some were CPU
replacement, the CPUs seemed to be burned. Failure was very difficult
to isolate, lots of symptoms, few were repeatable.
Jim Phillips wrote:
> This is very similar to problems we're seeing on our dual Athlon MP 2600+
> cluster with Gigabyte GA-7DPXDW+ motherboards, Intel PRO/1000 MT Server
> network cards, and Clustermatic 3 (on Red Hat 8). No solution, though.
> On Thu, 13 Nov 2003, Timothy R. Whitcomb wrote:
>>We are having an ongoing issue with our compute cluster, running Scyld
>>28cz4. It's a 5-node cluster (each node is dual-processor) with 4 compute
>>nodes and 1 master node. We are running the Navy's weather model.
>>The model runs fine when run on 4 processors (1 on each compute node).
>>However, when I use the SMP capabilities of the machine and try to run on,
>>say, 8 processors (using both CPUs on each compute node), everything will
>>run fine for a while. Then, at a non-consistent time, a node will
>>invariably freeze up. The cluster loses its connection to the
>>node and I cannot communicate with it using any of the cluster tools -
>>sometimes it will automatically reboot, but usually it requires me to go
>>perform a hard reset on the node.
>>However, I have found that in most cases if I run 2 jobs in parallel (i.e.
>>2 4-cpu processes, each using only 1 CPU on each node) things seem to work
>>fine. Nodes may still freeze from time to time but not nearly as often.
>>The cluster was obtained pre-built from PSSC LabsEach compute node is a
>>dual-processor Tyan MB with 2 Athlon MP CPUS. They
>>are also equipped with 2 on-board NICs (lspci gives them as 3com 3c982
>>Dual Port Server Cyclon rev 78 and the 3c59x kernel driver is used). We
>>are using the BeoMPI 1.0.7 implementation of MPICH compiled with:
>>(note that I had to recompile BeoMPI with the PGI compiler to get it to
>>work with the model)
>>Again, we use Scyld Beowulf 28cz4 for the operating system
>>uname -a gives
>>Linux nashi 2.4.17-0.18.18_Scyldsmp #1 SMP Thu Jul 11 19:26:54 EDT 2002
>>_Please_ help if you have _any_ suggestions whatsoever. I am at the end
>>of my rope, and this is presenting a serious impediment to our research!
>>If you need more information, let me know and I will be happy to provide
>>University of Washington Applied Physics Lab
>>twhitcomb at apl.washington.edu
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://scalableinformatics.com
phone: +1 734 612 4615
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf