[Beowulf] Surviving a double disk failure

Joe Landman landman at scalableinformatics.com
Fri Apr 10 08:18:03 EDT 2009

Stuart Midgley wrote:

Good work Stuart!

> What are the lessons learnt? Well with software raid Linux is both your 

1) Use RAID6.  It is your friend.  RAID5 is unashamedly your enemy.

2) Scrub early, scrub often.  We cron this ~1/week on Delta-V's (sounds 
similar to your box).

3) pay attention to any/every error.  Disk keeps giving you errors, toss it.

> friend and enemy. The behaviour of md got us in this mess. When md gets 
> an error on read it recovers the data from the other disks and re-writes 
> the blocks to the failed disk hoping the disk will reallocate. You do 
> get a warning saying that md encountered a recoverable error. So you 
> think it is ok. BUT the disk still failed on read and you haven't 
> swapped it out. Some time later when another disk fails hard and you get 
> a failed read on your other dodgy disk md sees 2 failed disks. And it's 
> all over.

This is why RAID6 is your friend.  Aside from this, the scrubbing mode 
of MD (would require a later kernel, bug me offline if you want to try 
one), is a lifesaver.

This and the later versions of the md tools.  The kernel, drivers, and 
tools with your distro are *ancient* by most standards.

> My advice:  don't let Linux collude with the disk vendors and reduce 

heh ...

> your reliability. Swap any disk that gets a correctable error on read.   
> Reallocation on write is fine not on read. The disk has failed.

add to this:

4) scheduled scrubbing to specifically detect these errors.  Turn on 
error correction bits for scrub to force it to try to correct errors.

Glad you were able to get your data back.


Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

More information about the Beowulf mailing list