(Scyld) Nodes going down unexpectedly

Timothy R. Whitcomb twhitcomb at apl.washington.edu
Thu Aug 7 15:55:15 EDT 2003

We have a 10-processor cluster and are currently running a weather model
on 4 of the processors.  When I try to up the number, it works for a
while, then the "beostatus" window will show one node's information not
changing for a little while before it shows the node status as "down".
Each node is dual-processor and I have noticed (but not verified) that
this becomes an issue when both processors on a node are in use.

After the node status changes to "down", I cannot restart it through the
console tools on the root node.  However, I know that the node is still
alive and on the network because I can ping it successfully.  This problem
requires me to actually restart the node by hand, which is a bit of an
issue since we're on opposite sides of the building.

What's going on here and what can I do to mitigate/fix this?

Tim Whitcomb
twhitcomb at apl.washington.edu
Applied Physics Lab
University of Washington

Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list