tech-kern: Re: Bad sectors vs RAIDframe

Subject: Re: Bad sectors vs RAIDframe
To: Thor Lancelot Simon <tls@rek.tjls.com>
From: Stephen Borrill <netbsd@precedence.co.uk>
List: tech-kern
Date: 06/08/2005 17:40:57
On Mon, 6 Jun 2005, Thor Lancelot Simon wrote:
> We got a bad run of Samsung Spinpoint drives that we unfortunately
> installed in NetBSD Foundation servers about a year ago.  I have had
> to recover several of them (all in 2-way RAIDframe mirrors) by using
> dd to copy the data from the corresponding sectors on one drive over
> the bad sectors of the other, often doing this in both directions to
> recover from multi-drive failures within a set.  Since then, RAIDframe
> has been changed so that it retries on disk error before failing a
> component, and never fails components from non-redundant sets -- so a
> newer kernel may let you get somewhere with data recovery, too.

I'm guessing these changes (or at least the second half) are:

http://mail-index.netbsd.org/source-changes/2004/01/02/0069.html

In which case, they are in 2.0, but not 1.6 and this gives me another 
avenue of recovery.

I've also successfully used the "64-sector offset as ffs disklabel 
partition" trick to recover (most) data.

This is happening far too often for me now though (just had another 
failure of 2 discs - this time 250GB). These are all with 1.6.2 machines, 
so it would be great to come up with a plan to minimise future problems. 
Especially as these are at (generally clueless) customer sites.

My understanding is:
With 1.6.2, a read error causes component failure. As the read is not 
retried a successfully ECC corrected sector will not be spotted. If you 
spot this in time, initiating a rewrite will generally be OK as upon a 
write failure it'll map in a new sector. It will happily fail all 
components in an array and then panic. In this respect, having a 
RAIDframe RAID 1 mirrored set is actually significantly worse than 
having a single disc (if you fail to spots failures quickly).

Doing a dd of the components to /dev/null will mitigate the problem by 
allowing read errors to be spotted without killing the component.

With 2.0 and later, a read error will be retried thus giving error 
correction a chance at working. Plus, a final component of an array will 
never be failed so even with multiple read errors you'll be able to 
recover some data (as well as you would on a plain ffs).

My plan is to hurry forward the migration to 2.0.2 from 1.6.2 
and to run a nightly check which will email both the customer and 
ourselves if it spots any failed components.

-- 
Stephen