Scyld Nodes Freezing w/ SMP (fwd)

Thu Nov 13 21:38:18 EST 2003

hi ya tim

i usually was able to fix random cpu crashes by changing
the kernel to the latest/greatest one at the time ( 2.4.22 ) if i
were to use a new smp kernel today

if the latest kernel has no effect, than there's some other
serious hw problems ... timing issues ??
	- make sure the kernel is compiled for athlon and not p4
	and smp enabled

	- memory clock speeds, marginal memeory sticks
	( get rid of generic no-name-brand memory sticks
		- swap memory sticks and see if the problem
		follow the memory ( keep good track of it 
		so you can easily identify it if all the memory
		was thrown on the floor all at the same time

	- make sure you only have 1 ide disk on each cable to
	help identify any other hw issues 

	- blow air, with a household 24"-36" fan, in the same direction
	as normal airflow of the system and see if it helps any

	- replace the home-made nic ables with molded cat-5 cables
	where its obvious that a person didnt hand-crimp the wires
		- swap the ports the the nic cables are connected to

		- inexpensive hubs is the next to swap out

c ya
alvin


On Thu, 13 Nov 2003, Timothy R. Whitcomb wrote:

> We are having an ongoing issue with our compute cluster, running Scyld
> 28cz4.  It's a 5-node cluster (each node is dual-processor) with 4 compute
> nodes and 1 master node.  We are running the Navy's weather model.
> 
> The problem:
> The model runs fine when run on 4 processors (1 on each compute node).
> However, when I use the SMP capabilities of the machine and try to run on,
> say, 8 processors (using both CPUs on each compute node), everything will
> run fine for a while.  Then, at a non-consistent time, a node will
> invariably freeze up.  The cluster loses its connection to the
> node and I cannot communicate with it using any of the cluster tools -
> sometimes it will automatically reboot, but usually it requires me to go
> perform a hard reset on the node.
> 
> However, I have found that in most cases if I run 2 jobs in parallel (i.e.
> 2 4-cpu processes, each using only 1 CPU on each node) things seem to work
> fine.  Nodes may still freeze from time to time but not nearly as often.
> 
> The hardware:
> The cluster was obtained pre-built from PSSC LabsEach compute node is a
> dual-processor Tyan MB with 2 Athlon MP CPUS.  They
> are also equipped with 2 on-board NICs (lspci gives them as 3com 3c982
> Dual Port Server Cyclon rev 78 and the 3c59x kernel driver is used).  We
> are using the BeoMPI 1.0.7 implementation of MPICH compiled with:
> --with-device=ch_p4 --with-comm=bproc
> (note that I had to recompile BeoMPI with the PGI compiler to get it to
> work with the model)
> Again, we use Scyld Beowulf 28cz4 for the operating system
> uname -a gives
> Linux nashi 2.4.17-0.18.18_Scyldsmp #1 SMP Thu Jul 11 19:26:54 EDT 2002
> i686 unknown
> 
> _Please_ help if you have _any_ suggestions whatsoever.  I am at the end
> of my rope, and this is presenting a serious impediment to our research!
> If you need more information, let me know and I will be happy to provide
> it!
> 
> Thanks...
> 
> Tim Whitcomb
> Meteorologist
> University of Washington Applied Physics Lab
> twhitcomb at apl.washington.edu
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf