tech-kern: Re: Bad sectors vs RAIDframe

Subject: Re: Bad sectors vs RAIDframe
To: None <tls@rek.tjls.com>
From: Greg Oster <oster@cs.usask.ca>
List: tech-kern
Date: 06/08/2005 12:26:36
Thor Lancelot Simon writes:
> On Wed, Jun 08, 2005 at 11:22:57AM -0600, Greg Oster wrote:
> > Thor Lancelot Simon writes:
> > 
> > > RAIDframe could clearly automatically DTRT in almost every case
> > > like this -- "regenerate the data from parity and write-back" is
> > > the same as "read from other half of mirror and write-back" but
> > > it's hard to see exactly how to make it do so. 
> > 
> > One would need to keep track of what stripe units have failed on 
> > which components, and then make sure that only "good" ones are used.
> > One could, in theory, divide a component into "zones", and only fail a 
> > "zone" instead of the entire component.  But that's just shuffling 
> > deck chairs in the case of a disk that's really starting to go 
> > south...
> 
> I think we're talking at cross-purposes.  What I'm suggesting is that
> we _know_ that, because the interface between them and the host doesn't
> really give them any other option, IDE drives generally spare sectors
> out only when those sectors are written to -- so, if you see a read
> error on such a disk, and you have the data available to write back,
> you should.

Ahh.. ok.  

> In the RAID case, if you're still redundant, you are guaranteed that
> you do have the data available to write back.  So, if you get a single
> error reading a stripe, you still got the data you needed in order to
> issue the write that will fix the bad sector on the one disk that
> failed the read.

Right.
 
> I'm not talking about post-failure recovery -- what I'm actually talking
> about is using the RAID redundancy to _synchronously_ fix bad sectors
> on IDE disks, so that it's never necessary to fail a component, a stripe,
> a hypothetical zone, etc. at all.
> 
> How hard it might be to do this in the error-recovery path in RAIDframe,
> I can only imagine (*shudder*) but it seems to me it's clearly the right
> thing to do.  Otherwise, any read of any bad sector is ultimately going
> to lead to failure of the entire component and the need to do a rebuild.

It would be.... "doable, but ugly"... perhaps "doable, but very ugly". 

Since every IO in RAIDframe is described by a DAG (Directed Acyclic 
Grpah) it could be a (non-trivial) matter of constructing the 
appropriate DAG at the right time.  Right now, the DAG construction 
for a read from a RAID 1 set goes something like:

 1) Generate a DAG for a read from whichever component is less busy 
(or using whatever algorithm happens to be handy)
 2) if the read for 1) fails, mark the component as failed, and 
create a DAG for a read from the remaining component.

What would need to happen is:
 1) Generate a DAG for a read from whichever component is less busy 
(or using whatever algorithm happens to be handy)
 2) If the read for 1) fails, generate a DAG that issues a read for 
the (hopefully) good component, writes the data to the failed 
component, and then returns the data.
 3) if the write in 2) fails, mark the component as failed

It's probably not quite this simple, because I don't think a DAG node 
is allowed to fail (i.e. for the write to fail) and yet have the DAG 
complete successfully.  Performing this operation within the read DAG 
might also not be the best -- it might be better to handle this as a 
reconstruction of just a single stripe...  But that triggers a whole 
other pile of hair (including how to deal with RAID sets where 
reconstruction is already in progress on some later part of the 
set...)  Perhaps a new "just this stripe" reconstruction routine 
would be useful here...

I agree this would be a good thing... Getting it right, however, is 
going to require some thought... (Proper testing may be a challenge too :) )

Later...

Greg Oster