|
Page 2 of 2
SMART: Usage Within Big Clusters
In the past I have mentioned the SMART (Self-Monitoring Analysis and Reporting Technology)
system included in virtually all modern hard drives. SMART
capable hard drives have added intelligence in the firmware to monitor
the drive and to attempt to detect hard drive failures. Also, SMART
Capable drives can perform various types of self-tests which are very
useful for diagnostics as well as monitor the temperature of the hard
drive (note: not all hard drives report the same information). There
is a nice package for Linux, called smartmontools, that allows you
to access the SMART information and to run self-tests on SMART capable
drives to help detect drives that are failing.
On February 14 of 2004, Konstantin Kudin asked if anyone was using
SMART monitoring of IDE drives in big clusters. He was curious how often
SMART was able to give some kind of warning of a failing drive within
24 hours of failure. Steve Timm responded that they had been using SMART
monitoring tools on their cluster and SMART was able to predict failure
about 50% of the time. Steven seemed very happy with this number.
Joe Mack posted a question about how one can get information out of
smartd (the daemon in smartmontools). Steve Timm replied
that they were using an older version that didn't have smartd and
just used a cron script to run a short test every night and
capture the output to a file. Steve also said that they were probably
going to switch over to using smartd and an agent that is already
grep-ing through /var/log/messages to capture the SMART
information.
Felix Rauch posted that he was using smartmontools as well and
had a few troubles grep-ing though the system logs, particularly when
the logs rotate. He now uses a simple setuid-root program to monitor
temperatures on the drives. Daniel Fernandez also mentioned that
it's possible to have smartd write to a file other than the
system logs and check it regularly for temperature. He also mentioned
that you can have smartd run a script if a problem develops.
This article was originally published in ClusterWorld Magazine. It has been
updated and formatted for the web. If you want to read more about HPC
clusters and Linux you may wish to visit
Linux Magazine.
Jeff Layton has been a cluster enthusiast since 1997 and spends far
too much time reading mailing lists. He can found hanging around the Monkey
Tree at ClusterMonkey.net (don't stick your arms through the bars though).
Comment on this article
You must login to leave comments...
Other Visitors Comments
There are no comments currently....
|