Subject: [Beowulf] S.M.A.R.T usage in big clusters

Mon Feb 16 12:10:43 EST 2004

> Message: 1
> Date: Sat, 14 Feb 2004 11:28:22 -0800 (PST)
> From: Konstantin Kudin <konstantin_kudin at yahoo.com>
> To: beowulf at beowulf.org
> Subject: [Beowulf] S.M.A.R.T usage in big clusters
>
>  I am curious if anyone is using SMART monitoring of
> ide drives in a big cluster.

Yes. We use smartmon tools

http://smartmontools.sourceforge.net/

Hard drive failures are by far the most common hardware failure we see on
our systems. We've hooked smartmontools into the batch queueing system we
use, so that if drives are flagged as failing, the host gets closed to new
jobs. (You could extend this to do checkpoint/migration if your code
supports it, ours doesn't.)

Our cluster typically runs fairly short jobs (less than 1 hour or so) so
jobs usually finish before the drive finally fails.  I haven't collected
any hard statistics on how many failures we catch before it impacts on a
user's work, but my gut feeling is that it catches over 80% of the cases,
and certainly enough for it to be worthwhile implementing.

Cheers,

Guy Coates

-- 
Guy Coates,  Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK
Tel: +44 (0)1223 834244 ex 7199

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf