[Beowulf] Re: failure trends in a large disk drive population

Mark Hahn hahn at mcmaster.ca
Wed Feb 21 21:44:26 EST 2007

> weakly correlated with failure.  However, of all the disks that failed, less 
> than half (around 45%) had ANY of the "strong" signals and another 25% had 
> some of the "weak" signals.  This means that over a third of disks that 
> failed gave no appreciable warning.  Therefore even combining the variables 
> would give no better than a 70% chance of predicting failure.

well, a factorial analysis might still show useful interactions.

> number of disks.  For example, among the disks that failed, many had a large 
> number of seek error; however, over 70% of disks in the fleet -- failed and 
> working -- had a large number of seek errors.

was there any trend across time in the seek errors?

> So that's our master plan.  Just don't tell anyone. :)

hah.  well, if it were me, the M.P. would involve some sort of proactive
treatment: say, a full-disk read once a day.  smart self-tests _ought_ 
to be more valuable than that, but otoh, the vendor probably munge the 
measurements pretty badly.

regards, mark hahn.
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


More information about the Beowulf mailing list