[Beowulf] Surviving a double disk failure
csamuel at vpac.org
Sun Apr 19 04:40:52 EDT 2009
----- "Joe Landman" <landman at scalableinformatics.com> wrote:
> 2) Scrub early, scrub often.
As long as you don't have IBM gear where what appears to
be a firmware issue somewhere (possibly on the disks themselves)
can mean that the LSI RAID controller they rebadge thinks
that up to 12 drives have just failed in the space of a
Of course none of them really have failed, but your RAID60
is still toast and boy does it take a few years off your life,
not to mention days and days to recover from tape..
Happens under Debian (with mainline kernel) and CentOS
with its stock kernel (we copied over the scrub script
that Debian packages), but of course IBM wouldn't take
any notice until we could do it under RHEL - you can
trigger a scrub manually through (for example):
echo check > /sys/block/md0/md/sync_action
We now have another vendors storage unit and won't
think about using the IBM unit in anger until we can
confirm that the latest round of firmware updates have
solved the problem.
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the Beowulf