Subject: Re: Bad sectors vs RAIDframe
To: None <tls@rek.tjls.com>
From: Greg Oster <oster@cs.usask.ca>
List: tech-kern
Date: 06/08/2005 11:22:57
Thor Lancelot Simon writes:
> On Wed, Jun 08, 2005 at 05:40:57PM +0100, Stephen Borrill wrote:
> > On Mon, 6 Jun 2005, Thor Lancelot Simon wrote:
> > >We got a bad run of Samsung Spinpoint drives that we unfortunately
> > >installed in NetBSD Foundation servers about a year ago.  I have had
> > >to recover several of them (all in 2-way RAIDframe mirrors) by using
> > >dd to copy the data from the corresponding sectors on one drive over
> > >the bad sectors of the other, often doing this in both directions to
> > >recover from multi-drive failures within a set.  Since then, RAIDframe
> > >has been changed so that it retries on disk error before failing a
> > >component, 

Oh... "not quite".  RAIDframe will not retry on a disk error where 
the RAID set is operating in redundant ("normal") mode.  In the 
redundant case, if the underlying device indicates an error, the 
component is immediately marked as failed.

> > >and never fails components from non-redundant sets -- so a
> > >newer kernel may let you get somewhere with data recovery, too.
> > 
> > I'm guessing these changes (or at least the second half) are:
> > 
> > http://mail-index.netbsd.org/source-changes/2004/01/02/0069.html
> 
> I don't think that's all of it -- it looks too early.  Greg?

That is correct.  The bits needed are these (in part): 

http://mail-index.netbsd.org/source-changes/2005/04/06/0081.html

> I think the changes are in the 2.0 branch _now_ but I don't think
> they were in it when 2.0 was built and released.

Right.
 
> > With 1.6.2, a read error causes component failure. As the read is not 
> > retried 

The read will not be retried at the RAIDframe level.  It may be 
retried in a lower-level driver. 

> > a successfully ECC corrected sector will not be spotted. If you 
> > spot this in time, initiating a rewrite will generally be OK as upon a 
> > write failure it'll map in a new sector. It will happily fail all 
> > components in an array and then panic. In this respect, having a 
> > RAIDframe RAID 1 mirrored set is actually significantly worse than 
> > having a single disc (if you fail to spots failures quickly).
> 
> That's true for 1.6.2, at least.  Actually, with the change to never
> fail a component of a non-redundant set due to disk error, simply
> telling the set to rebuild will issue the sector writes necessary
> to fix the problem -- unless the rebuild fails because it can only
> move the data one way, and it can't read some of it from the "from"
> component (whichever read errored last); working by hand you can do
> the right thing, which is immediately upon read error dd the data
> from the _other_ half of the mirror back to the half that errored.

Well... in the hopes that what you are dd'ing over is the same as 
what you are overwriting... There's no guarantee that it is if any 
writes have occured in the time between when the first disk fails and 
when the read error on the other disk occurs... 

> RAIDframe could clearly automatically DTRT in almost every case
> like this -- "regenerate the data from parity and write-back" is
> the same as "read from other half of mirror and write-back" but
> it's hard to see exactly how to make it do so. 

One would need to keep track of what stripe units have failed on 
which components, and then make sure that only "good" ones are used.
One could, in theory, divide a component into "zones", and only fail a 
"zone" instead of the entire component.  But that's just shuffling 
deck chairs in the case of a disk that's really starting to go 
south...

Given sufficiently flakey hardware, even RAIDframe doesn't help much... 

> The internals of RAIDframe scare me.

"Welcome to my world...." :)

Later...

Greg Oster