Scyld Nodes Freezing w/ SMP (fwd)

Joseph Landman landman at scalableinformatics.com
Fri Nov 14 12:29:52 EST 2003


Hi Jim:

   I commented to Timothy offline that I am seeing stability problems on 
my customers machines based upon Tyan 2466 MB's.

   Some success via MB replacement (after isolating subsystems through 
memory tests/exchange, IO loads, net loads,...).  Some were CPU 
replacement, the CPUs seemed to be burned.  Failure was very difficult 
to isolate, lots of symptoms, few were repeatable.

Joe

Jim Phillips wrote:

> Hi,
> 
> This is very similar to problems we're seeing on our dual Athlon MP 2600+
> cluster with Gigabyte GA-7DPXDW+ motherboards, Intel PRO/1000 MT Server
> network cards, and Clustermatic 3 (on Red Hat 8).  No solution, though.
> 
> -Jim
> 
> 
> On Thu, 13 Nov 2003, Timothy R. Whitcomb wrote:
> 
> 
>>We are having an ongoing issue with our compute cluster, running Scyld
>>28cz4.  It's a 5-node cluster (each node is dual-processor) with 4 compute
>>nodes and 1 master node.  We are running the Navy's weather model.
>>
>>The problem:
>>The model runs fine when run on 4 processors (1 on each compute node).
>>However, when I use the SMP capabilities of the machine and try to run on,
>>say, 8 processors (using both CPUs on each compute node), everything will
>>run fine for a while.  Then, at a non-consistent time, a node will
>>invariably freeze up.  The cluster loses its connection to the
>>node and I cannot communicate with it using any of the cluster tools -
>>sometimes it will automatically reboot, but usually it requires me to go
>>perform a hard reset on the node.
>>
>>However, I have found that in most cases if I run 2 jobs in parallel (i.e.
>>2 4-cpu processes, each using only 1 CPU on each node) things seem to work
>>fine.  Nodes may still freeze from time to time but not nearly as often.
>>
>>The hardware:
>>The cluster was obtained pre-built from PSSC LabsEach compute node is a
>>dual-processor Tyan MB with 2 Athlon MP CPUS.  They
>>are also equipped with 2 on-board NICs (lspci gives them as 3com 3c982
>>Dual Port Server Cyclon rev 78 and the 3c59x kernel driver is used).  We
>>are using the BeoMPI 1.0.7 implementation of MPICH compiled with:
>>--with-device=ch_p4 --with-comm=bproc
>>(note that I had to recompile BeoMPI with the PGI compiler to get it to
>>work with the model)
>>Again, we use Scyld Beowulf 28cz4 for the operating system
>>uname -a gives
>>Linux nashi 2.4.17-0.18.18_Scyldsmp #1 SMP Thu Jul 11 19:26:54 EDT 2002
>>i686 unknown
>>
>>_Please_ help if you have _any_ suggestions whatsoever.  I am at the end
>>of my rope, and this is presenting a serious impediment to our research!
>>If you need more information, let me know and I will be happy to provide
>>it!
>>
>>Thanks...
>>
>>Tim Whitcomb
>>Meteorologist
>>University of Washington Applied Physics Lab
>>twhitcomb at apl.washington.edu
>>_______________________________________________
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>>
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 

Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
phone: +1 734 612 4615


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list