[Beowulf] IB symbol error thresholds for health check scripts ?
stuartb at 4gh.net
Wed Dec 29 13:29:21 EST 2010
On Mon, 13 Dec 2010 at 17:43 -0000, Christopher Samuel wrote:
> We run a bunch of health checks  on a compute node through Torque
>  and if they fail the node gets knocked offline.
Can you share these scripts? I'm needing to get something started
along these lines (torque, Moab, Infiniband, IBM system x, xCAT).
I'm sure I'll find things needing adaption to our environment.
> One of the checks we do is to check that there are no symbol errors
> on the IB link. However, I'm wondering if simply saying a single
> error is too brutal for this - what do other people do about these ?
I'm looking at Infiniband problems currently and have been watching
our SymbolErrorCounter values. I'm told a "small number" of these
errors are okay. I don't know the definition of "small" or over how
long a time period.
Over the last week 24 of our nodes have shown at least two errors.
Of these 6 nodes are showing over 400 errors (450-30000) and these
nodes need attention (I've manually downed them until I can get to the
hardware). The remaining nodes are all < 50 errors, with half of
those < 10.
I'm planning to do more proactive monitoring of the Infiniband Fabric.
The current toolset is very awkward to use for monitoring. There is
an updated Infiniband Fabric Suite from QLogic which appears to
improve this significantly. It should be possible to do the
Infiniband monitoring completely off node so as to not perturb the
computations too much.
>  - for the record we check things like - amount of RAM, failed
> DIMMs (via IPMI on IBM or memlog on SGI), number of cores, number
> and speed of CPUs, LDAP OK, home directories accessible, etc.
All things we need to check. I manually found several of our nodes
running with one disabled RAM stick.
>  - checks run prior to a job start, after a job exits and every
> 7.5 minutes (every 10 mom intervals).
Also when the node comes up before mom starts I assume?
I've never been lost; I was once bewildered for three days, but never lost!
-- Daniel Boone
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the Beowulf