[Beowulf] IB symbol error thresholds for health check scripts ?

Christopher Samuel samuel at unimelb.edu.au
Mon Dec 13 17:43:19 EST 2010


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi folks,

We run a bunch of health checks [1] on a compute node through
Torque [2] and if they fail the node gets knocked offline.

One of the checks we do is to check that there are no symbol
errors on the IB link. However, I'm wondering if simply saying
a single error is too brutal for this - what do other people do
about these ?

cheers!
Chris

[1] - for the record we check things like - amount of RAM,
failed DIMMs (via IPMI on IBM or memlog on SGI), number of
cores, number and speed of CPUs, LDAP OK, home directories
accessible, etc.

[2] - checks run prior to a job start, after a job exits
      and every 7.5 minutes (every 10 mom intervals).

- -- 
 Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computational Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
         http://www.vlsci.unimelb.edu.au/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk0GoYcACgkQO2KABBYQAh9w1gCgh19IOhXa5BWOmC3+qyZaDDr/
MrYAn1at4YwaaNkmmZpNAVNHBF0OIH0V
=/gDC
-----END PGP SIGNATURE-----
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



More information about the Beowulf mailing list