netbsd-users: Re: Bad sectors vs RAIDframe

Subject: Re: Bad sectors vs RAIDframe
To: Greg Oster <oster@cs.usask.ca>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: netbsd-users
Date: 06/08/2005 13:43:20

On Wed, Jun 08, 2005 at 11:22:57AM -0600, Greg Oster wrote:
> Thor Lancelot Simon writes:
> 
> > RAIDframe could clearly automatically DTRT in almost every case
> > like this -- "regenerate the data from parity and write-back" is
> > the same as "read from other half of mirror and write-back" but
> > it's hard to see exactly how to make it do so. 
> 
> One would need to keep track of what stripe units have failed on 
> which components, and then make sure that only "good" ones are used.
> One could, in theory, divide a component into "zones", and only fail a 
> "zone" instead of the entire component.  But that's just shuffling 
> deck chairs in the case of a disk that's really starting to go 
> south...

I think we're talking at cross-purposes.  What I'm suggesting is that
we _know_ that, because the interface between them and the host doesn't
really give them any other option, IDE drives generally spare sectors
out only when those sectors are written to -- so, if you see a read
error on such a disk, and you have the data available to write back,
you should.

In the RAID case, if you're still redundant, you are guaranteed that
you do have the data available to write back.  So, if you get a single
error reading a stripe, you still got the data you needed in order to
issue the write that will fix the bad sector on the one disk that
failed the read.

I'm not talking about post-failure recovery -- what I'm actually talking
about is using the RAID redundancy to _synchronously_ fix bad sectors
on IDE disks, so that it's never necessary to fail a component, a stripe,
a hypothetical zone, etc. at all.

How hard it might be to do this in the error-recovery path in RAIDframe,
I can only imagine (*shudder*) but it seems to me it's clearly the right
thing to do.  Otherwise, any read of any bad sector is ultimately going
to lead to failure of the entire component and the need to do a rebuild.

Thor