[Beowulf] SMP Nodes [still] Freezing w/ Scyld (follow-up, long)

Timothy R. Whitcomb twhitcomb at apl.washington.edu
Wed Jan 7 15:32:12 EST 2004


First of all, thanks to everyone who responded to my first message.  I
received many helpful suggestions to test, which I have documented
here.

As a quick review of my (still unsolved) situation, we are running a
weather model on a 5-node, 10-processor Scyld Beowulf cluster.  As
long as we use a single processor per machine, performance is fine -
but when we use both processors on a single node then the node freezes
after a variable length of time (sometimes minutes, sometimes hours).

The suggestions I received are summarized as I have tested them:

-Make sure kernel is compiled for athlon architecture
> Kernel is indeed compiled for Athlon and SMP

-Try 1 IDE disk/cable
> There is only 1 disk/cable in the slave nodes

-Manually ventilate cases
> no change in results when cases were opened up and fanned

-Replace NIC cables
> replaced all NIC cables with no change in results

-Tyan 2466 motherboards have stability issues
> This is not a 2466 motherboard

-Check with the Navy and see if this is run with SMP
> The Naval Research Lab runs the model using SMP machines and a Linux
cluster (though not necessarily Scyld)

-Check thermal problems
> no change

-Flash update the BIOS
> no change, BIOS updated to most recent version

-Run CPUBURN for several days
> ran over the weekend on all nodes, system seemed stable

-Try new network drivers
> Updated network drivers to Becker's most recent 3c59x for the dual
3Com 3c982 Server Tornado built-in Ethernet, but with no apparent
change

The suggestions that I have not been able to try as yet:
-Try upgrading the kernel and bproc package

Since this is a Scyld system purchased from a retailer, the kernel has
all sorts of patches in it.  Can anyone point me to anywhere detailing
how to do a kernel upgrade on one of these systems?  I've built
kernels before but usually only using the kernel.org sources.

I'd also like to know if it is possible for the cluster interconnects
to be "too slow" for the software - there is a 100BaseT ethernet
between the nodes but there is a _lot_ of data being passed around.

Something else I have noticed is that if I run a 2-processor run on 2
nodes, it goes fairly quickly.  However, if I switch it to 2
processors on 1 machine, it stalls for very long periods of time.
>From the memory usage, it looks like there's quite a bit in the swap
memory (there's 512MB/node of RAM and 1GB of swap) but there is little
to no CPU access on either processor for very long periods of time (on
the order of several minutes and up) whereas in the 2-node case there
are brief pauses, but CPU usage jumps up to 100% fairly quickly.  On
the 1-node case, one processor will go up a little but then fall back
down quickly.

Are there any other ideas as to what would be causing these nodes to
freeze up?  Thanks again for all the help I've received.

Tim Whitcomb


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list