(Scyld) Nodes going down unexpectedly
Timothy R. Whitcomb
twhitcomb at apl.washington.edu
Thu Aug 7 15:55:15 EDT 2003
We have a 10-processor cluster and are currently running a weather model
on 4 of the processors. When I try to up the number, it works for a
while, then the "beostatus" window will show one node's information not
changing for a little while before it shows the node status as "down".
Each node is dual-processor and I have noticed (but not verified) that
this becomes an issue when both processors on a node are in use.
After the node status changes to "down", I cannot restart it through the
console tools on the root node. However, I know that the node is still
alive and on the network because I can ping it successfully. This problem
requires me to actually restart the node by hand, which is a bit of an
issue since we're on opposite sides of the building.
What's going on here and what can I do to mitigate/fix this?
Tim Whitcomb
twhitcomb at apl.washington.edu
Applied Physics Lab
University of Washington
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list