Subject: Re: Why my life is sucking. Part 2.
To: Greg Oster <oster@cs.usask.ca>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: current-users
Date: 01/18/2001 19:20:42
On Thu, Jan 18, 2001 at 03:00:30PM -0600, Greg Oster wrote:
> Bill Sommerfeld writes:
> > > Wouldn't it be better to first check the overall status of the array?
> > > And once the array's parity has been correctly written, you can free
> > > the memory used to hold this bitmap.  It means that you're doing two
> > > checks, not just one, while you're actually doing the on-demand
> > > re-writing of the parity; but when you're not fixing parity, it ought
> > > to save you memory, and probably time, too, when you think about
> > > keeping that whole bitmap in the CPU's cache...
> > > 
> > > if(array_is_dirty)
> > >     if(this_block_is_dirty)
> > >         rewrite_parity();
> > 
> > if there's already a function pointer at the right place in the I/O
> > path, you can do the check with zero overhead -- you start off with it
> > pointing to the "dirty, slow" path and once parity is cleaned up
> > re-point it ot the "clean, fast" path.
> 
> Yup... I havn't had time to look, but I suspect it can be found if one looks 
> hard enough :)  
> 
> A few other things about "parity rewrite on demand".
> 1) if a block is to be read, then the associated stripe must have its parity 
> updated before the block is returned.  (If it is not, and the component that 
> block lives on dies, then that block could be reconstructed incorrectly.)
> 2) if a block is to be written, then the associated stripe must have its 
> parity updated before the block is written. (same reason as above)
> 3) there could *still* be other stripes where the parity is incorrect, and 
> where a failed component would result in incorrect data being reconstructed.
> 
> While 1) and 2) help in getting the parity correct, allowing other 'normal'
> IO increases the amount of time through which 3) poses a major problem.

I strongly believe the above analysis to be incorrect.  3) holds whether
the machine is switched on doing a parity rebuild only or whether it's doing
"real" I/O at the same time -- doing "real" I/O, so long as it's always
preceded by a parity update if required, does not increase one's risk of
encountering a fatal failure except inasmuch as it may lengthen the time
window of exposure to it.  *HOWEVER*,

* If you're not doing very much "real" I/O, you won't move the heads around
  much and thereby interfere with the I/O being generated by the parity
  rebuild thread.  So, if you aren't doing much "real" I/O, you won't
  actually hurt the parity rebuild time much at all.

* If you *are* doing a lot of "real" I/O, you're causing the parity for
  every stripe you touch to be synchronously rebuilt, so though you change
  the *ordering* of the rebuild, you don't make it much slower.

And, *so long as you update parity every time you write a stripe for any
reason*, you never actually increase the chance of having a fatal failure.

I repeat: this issue is well-understood; this is how every dedicated
hardware RAID controller I've encountered does it; we should do it the
same way.

Another thing to consider is to be smarter about the parity updates: try to
keep the number of entries in the queue at a constant length using parity
update I/O, and use the last known head position (that is, the request
that was last in the queue when the number of I/O requests in the queue
fell below some threshold value) to decide where to start generating parity
update I/Os for your online rebuild.  You don't *have* to rebuild the
disk in order from one end to the other, and not trying to do so will likely
yield radically better rebuild performance in the presence of other I/O
sources.

Thor