Subject: Re: kern/30674: RAIDframe should be able to create volumes without parity rewrite
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Greg Oster <oster@cs.usask.ca>
List: netbsd-bugs
Date: 07/06/2005 15:20:02
The following reply was made to PR kern/30674; it has been noted by GNATS.

From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@netbsd.org, Matthias Scheler <tron@colwyn.zhadum.de>
Cc: 
Subject: Re: kern/30674: RAIDframe should be able to create volumes without parity rewrite 
Date: Wed, 06 Jul 2005 09:19:19 -0600

 Matthias Scheler writes:
 > >Number:         30674
 > >Category:       kern
 > >Synopsis:       RAIDframe should be able to create volumes without parity re
 > write
 > >Confidential:   no
 > >Severity:       non-critical
 > >Priority:       medium
 > >Responsible:    kern-bug-people
 > >State:          open
 > >Class:          change-request
 > >Submitter-Id:   net
 > >Arrival-Date:   Wed Jul 06 09:52:00 +0000 2005
 > >Originator:     Matthias Scheler
 > >Release:        NetBSD 3.99.7
 > >Organization:
 > Matthias Scheler                                  http://scheler.de/~matthias
 > /
 > >Environment:
 > System: NetBSD lyssa.zhadum.de 3.99.7 NetBSD 3.99.7 (LYSSA) #0: Mon Jul 4 10:
 > 16:28 BST 2005 tron@lyssa.zhadum.de:/src/sys/compile/LYSSA i386
 > Architecture: i386
 > Machine: i386
 > >Description:
 > Setting up a RAIDframe volume requires an initial parity rewrite which
 > can take a long time. This is completely pointless because the volume
 > doesn't contain any data yet.
 
 Let's address the RAID 1 case first:
 If you're just going to build a FFS on it, then one can get away with 
 marking the parity as "good" because data will never be read until 
 after it has been written.  Fine.  If the machine crashes or 
 otherwise goes down without marking the parity as "good", then you are
 back to square one -- you *HAVE* to do the parity rebuild at that 
 point, since you have no guarantee that there were no writes in 
 progress, and that for a given sector that the primary and the mirror 
 are in sync.  So the only thing you've saved is the initial rebuild 
 (and there's nothing saying you can't do that initial rebuild in the 
 background sometime after you're using the partition).
 
 There is, however, also a violation of the Principle of Least Astonishment.
 If, for example, the components had random data on them before the 
 RAID 1 set was created, and one does two "dd if=/dev/rraid0d | md5" 
 with the parity marked as "good" (but not actually synced!) then one
 might well yield different results.  One certainly does not expect a 
 "disk device" to return different data on subsequent reads!  (RAIDframe 
 will pick either the master or the mirror to read from -- in cases where 
 data is already written, this won't be a problem.  In cases where data
 has not been written to that sector, but we are still claiming that 
 the parity is good, it will violate the PoLA.)
 
 Let's now look at the RAID 5 case: Consider a stripe made up of 
 component blocks A, B, C, D, and E.  Let A be the block being updated, 
 and E be the parity for the stripe.  Let E not be the XOR of A+B+C+D, 
 which will be the case if the parity rewrite is not done.
 To do a write of A, the old contents of A will be read, the current 
 contents of E will be read, a new E will be computed, and the new A 
 and new E will be written.  In the event that A fails, there is now 
 no way of reconstructing the contents of A, since B, C, and D were 
 never in sync with E, and thus are useless in recomputing A.  For a 
 RAID 5 case, one *MUST* rebuild the parity before live data is put on 
 the RAID set, as otherwise there will be no way of reconstructing 
 data in the event of a component failure.
 
 I've heard the argument a couple of times, but I don't see it buying 
 anything other than removing one parity rebuild...
 
 Further comments?  As you can guess, I'm not seeing any real advantage to 
 creating volumes without parity rewrites, even for RAID 1 sets.
 
 Later...
 
 Greg Oster