Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Thoughts on large disk reliability and raid maintenance?



FYI, the Apple Xserver RAID boxes have some form of data scrubbing too.

On 11-Sep-08, at 1:13 PM, David Maxwell wrote:
Please be sure to read the following paper before doing any work on
this. It can do more harm than good. See section 4.2

http://pages.cs.wisc.edu/~krioukov/Krioukov-ParityLost.pdf

I think there are a couple of issues here, though I'm just going on my gut feeling from a first read of that paper. The main point though is that data scrubbing need not do more harm, just so long as the potentially harmful step in data scrubbing is not implemented.
The first thing that it was very good to be reminded of was that  
traditional RAID does not guarantee protection from data corruption  
for anything more than latent (and detectable) sector errors and  
(detectable) drive/controller failures.  For RAID to provide adequate  
protection the hardware must report a read error whenever a disk  
sector read would otherwise return incorrect data (or no data at all).
I'm not so sure though that I agree with their basic analysis that  
scrubbing might increase the risk of data corruption, at least within  
the threat model that RAID is intended to protect from, i.e. that the  
hardware will indeed always report an error whenever a disk sector  
read would otherwise return incorrect data.  I think disk scrubbing  
will only _increase_ the risk of loss if the other types of errors are  
considered, _and_ if the scrubbing step I identify below is included.
So, it also seems to me that the method of scrubbing described in the  
paper seems flawed.  I think it goes one step too far.
If I understand correctly the authors say as much in the following  
sentence near the end of section 4.3:
	"Any protection technique that does not co-operate with RAID, allows  
parity recalculation to use bad data, causing irreversible data loss."
and it's probably not just irreversible, but undetectable as well (at  
least within the storage subsystem itself).
I think the problem with the scrubbing scheme they describe is in this  
last step:
	"The scrub also re-computes the parity of the data blocks and  
compares it with the parity stored on the parity disk, thereby  
detecting any inconsistencies[3]."
In reference [3], also by the same primary author, this procedure is  
further detailed as:
	"If the parity does not match the verified data, the scrub process  
fixes the parity by regenerating it from the data blocks."
(They don't seem to give any reference as to where the scrubbing  
algorithm containing this step comes from.)
Personally I would not want my RAID system to do that -- by definition  
this situation should indicate an inconsistency which the storage  
system cannot correct on its own.  An error should be reported to the  
user and the data should be restored from backups.
In that second paper they go on to describe the state being "fixed" as  
"parity inconsistency" and they go on to describe it as follows:
	"+ Parity inconsistencies (PIs):  This corruption class refers to a  
mismatch between the parity computed from data blocks and the parity  
stored on disk despite the individual checksums being valid.  This  
error could be caused by lost or misdirected writes, in-memory  
corruption, processor miscalculations, and software bugs.  Parity  
inconsistencies are detected only during data scrubs."
Indeed none of these errors can be reported by the disk or controller  
when the data is read, and thus none of these errors can be reliably  
corrected by simply re-computing and re-writing the parity!  I would  
claim that without external means to verify the integrity of these  
data blocks it is impossible to know whether or not the new parity  
block is being computed from valid data.  The on-disk ECC/checksums  
(and perhaps even block checksums) are not sufficient for this  
purpose.  These kinds of errors are outside the scope of RAID  
protection and foiling the RAID system by re-computing the parity  
block does not fix anything.
They even seem to say so much in that second paper, though at the same  
time glossing over the issue I raise:
        5.2 Parity Inconsistencies

        These errors are detected by data scrubbing.  In the absence
        of a second parity disk, one cannot identify which
        disk is at fault.  Therefore, in order to prevent potential
        data loss on disk failure, the system fixes the inconsistency
        by rewriting parity. This scenario provides further
        motivation for double-parity protection schemes.

        [[....]]

        These results assume that the parity disk is at fault.  We
        believe that counting the number of incorrect parity disks
        reflect the actual number of error disks since:  (i) entire
        shelves of disks are typically of the same age and same
        model, (ii) the incidence of these inconsistencies is quite
        low; hence, it is unlikely that multiple different disks in
        the same RAID group would be at fault.

Their own analysis of other types of errors suggests to me that it is not actually safe to assume that the parity disk is at fault. A detectable latent sector error in the parity disk would have been caught by the initial steps in the scrubbing. The probability of any undetected error must therefore be distributed evenly across all the disks involved, making it unsafe to re-compute and re-write the parity block.
They finally say this in section 4.7:

        + Parity pollution:  We believe that any parity-based
        system that re-uses existing data to compute parity is
        potentially susceptible to data loss due to disk errors, in
        particular lost and misdirected writes.  In the absence
        of techniques to perfectly verify the integrity of existing
        disk blocks used for re-computing the parity, disk scrubbing
        and partial-stripe writes can cause parity pollution,
        where the parity no longer reflects valid data.

So, (here's where my gut feeling is) w.r.t. implementing some form of scrubbing in RAIDframe, it would be my guess that so long as this "PI" fixing step discussed above is _not_ included, then scrubbing will still improve the reliability of RAIDframe storage systems (especially in those cases where all the data is not regularly re-read in such a way that user reads would detect latent sector errors before multiple errors could occur cause data loss). I.e. RAIDframe with disk scrubbing (but no PI fixing) should be no more susceptible to data loss or corruption than it would be without, and it should be more robust to latent sector errors developing in parallel on multiple disks while data sits idle.
In light of the probability of the other kinds of errors these authors  
discuss it would probably also be good if some form of read-verify  
protection could be included in RAIDframe too.
It might also be interesting if some form of block integrity checksum  
could be included too, i.e. a checksum stored in an additional sector  
allocated at the end of every block (stripe).  (This also raises the  
issue of whether or not NetBSD+RAIDframe could make use of, say, 520- 
byte sector drives such that the checksum could be appended to the  
block and not require a whole additional sector, i.e. as described in  
2.2.1 of ref[3].)
The partial stripe-write issue, which seems to be independent of  
scrubbing, might also need addressing in some way.  My naive  
understanding of their state diagrams and RAID parity algorithms seems  
to suggest that only one form of partial stripe write re-computation  
of the parity can cause data loss, but I'm not sure if that method can  
be avoided.
Of course data scrubbing for various RAID mirroring (i.e. forms with  
no parity) would seem to always be a win-win feature.
--
                                        Greg A. Woods; Planix, Inc.
                                        <woods%planix.ca@localhost>

Attachment: PGP.sig
Description: This is a digitally signed message part



Home | Main Index | Thread Index | Old Index