Subject: Re: Bad sectors vs RAIDframe
To: Thor Lancelot Simon <firstname.lastname@example.org>
From: Stephen Borrill <email@example.com>
Date: 06/08/2005 17:40:57
On Mon, 6 Jun 2005, Thor Lancelot Simon wrote:
> We got a bad run of Samsung Spinpoint drives that we unfortunately
> installed in NetBSD Foundation servers about a year ago. I have had
> to recover several of them (all in 2-way RAIDframe mirrors) by using
> dd to copy the data from the corresponding sectors on one drive over
> the bad sectors of the other, often doing this in both directions to
> recover from multi-drive failures within a set. Since then, RAIDframe
> has been changed so that it retries on disk error before failing a
> component, and never fails components from non-redundant sets -- so a
> newer kernel may let you get somewhere with data recovery, too.
I'm guessing these changes (or at least the second half) are:
In which case, they are in 2.0, but not 1.6 and this gives me another
avenue of recovery.
I've also successfully used the "64-sector offset as ffs disklabel
partition" trick to recover (most) data.
This is happening far too often for me now though (just had another
failure of 2 discs - this time 250GB). These are all with 1.6.2 machines,
so it would be great to come up with a plan to minimise future problems.
Especially as these are at (generally clueless) customer sites.
My understanding is:
With 1.6.2, a read error causes component failure. As the read is not
retried a successfully ECC corrected sector will not be spotted. If you
spot this in time, initiating a rewrite will generally be OK as upon a
write failure it'll map in a new sector. It will happily fail all
components in an array and then panic. In this respect, having a
RAIDframe RAID 1 mirrored set is actually significantly worse than
having a single disc (if you fail to spots failures quickly).
Doing a dd of the components to /dev/null will mitigate the problem by
allowing read errors to be spotted without killing the component.
With 2.0 and later, a read error will be retried thus giving error
correction a chance at working. Plus, a final component of an array will
never be failed so even with multiple read errors you'll be able to
recover some data (as well as you would on a plain ffs).
My plan is to hurry forward the migration to 2.0.2 from 1.6.2
and to run a nightly check which will email both the customer and
ourselves if it spots any failed components.