Dual-Athlon Cluster Problems

Chris Steward chris at wehi.edu.au
Thu Jan 23 01:45:40 EST 2003


Hi,

We're in the process of setting up a new 32-node dual-athlon cluster running
Redhat 7.3, kernel 2.4.18-19.7.smp. The configuration is attached below. We're
having problems with nodes hanging during calculations, sometimes only after
several hours of runtime. We have a serial console connected to such nodes but
that is unable to interact with the nodes once they hang. Nothing is logged
either. It seems that running jobs on one CPU doesn't seem to present too much
of a problem, but when the machines are fully loaded (both CPU's 100%
utilization) errors start to occur and machines die – often up to 8 nodes
within 24 hours. Temperature of nodes under full load is approximately 55C.
We have tried using the "noapic" option but the problems still persist.  Using
other software not requiring enfuzion 6 also produces the same problems.

The seek feedback on the following:

1/ Are there issues using redhat 7.3 as opposed to 7.2 in such
   a setup ?

2/ Are there known issues with 2.4.18 kernels and AMD chips ?
   We suspect the problems are kernel related.

3/ Are there any problems with dual-athlon clusters using the
   MSI K7D Master L motherboard ?

4/ Are there any other outstanding issues with these machines 
   under constant heavy load ?

Any advice/help would be greatly appreciated.

Thanks in advance

Chris

--------------------------------------------------------------
Cluster configuration

node configuration:

CPU's:                   Athlon MP2000+
RAM:                  	  1024Mb Kingston PC2100 DDR
Operating system:     	  Redhat 7.3 (with updates)
Kernel:                  2.4.18-19.7.xsmp
Motherboard:             MSI K7 Master L motherboard (Award Bios 1.5).
Network:                 On-board PCI (Ethernet controller: Intel Corp.
82559ER (rev 09)). (Using latest Intel drivers, "no sleep" option set)

head-node:

CPU 			   single Athlon MP2000+

Dataserver:

CPU: 			  single Athlon MP2000 &
Network:		  PCI Gigabit NIC

Network Interconnect:

cisco 2950 (one GBIC installed)

Software:

Cluster management	Enfusion 6
Computational		Dock V4.0.1




_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list