Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Thoughts on large disk reliability and raid maintenance?

FYI, the Apple Xserver RAID boxes have some form of data scrubbing too.

On 11-Sep-08, at 1:13 PM, David Maxwell wrote:

Please be sure to read the following paper before doing any work on
this. It can do more harm than good. See section 4.2

I think there are a couple of issues here, though I'm just going on my gut feeling from a first read of that paper. The main point though is that data scrubbing need not do more harm, just so long as the potentially harmful step in data scrubbing is not implemented.

The first thing that it was very good to be reminded of was that traditional RAID does not guarantee protection from data corruption for anything more than latent (and detectable) sector errors and (detectable) drive/controller failures. For RAID to provide adequate protection the hardware must report a read error whenever a disk sector read would otherwise return incorrect data (or no data at all).

I'm not so sure though that I agree with their basic analysis that scrubbing might increase the risk of data corruption, at least within the threat model that RAID is intended to protect from, i.e. that the hardware will indeed always report an error whenever a disk sector read would otherwise return incorrect data. I think disk scrubbing will only _increase_ the risk of loss if the other types of errors are considered, _and_ if the scrubbing step I identify below is included.

So, it also seems to me that the method of scrubbing described in the paper seems flawed. I think it goes one step too far.

If I understand correctly the authors say as much in the following sentence near the end of section 4.3:

"Any protection technique that does not co-operate with RAID, allows parity recalculation to use bad data, causing irreversible data loss."

and it's probably not just irreversible, but undetectable as well (at least within the storage subsystem itself).

I think the problem with the scrubbing scheme they describe is in this last step:

"The scrub also re-computes the parity of the data blocks and compares it with the parity stored on the parity disk, thereby detecting any inconsistencies[3]."

In reference [3], also by the same primary author, this procedure is further detailed as:

"If the parity does not match the verified data, the scrub process fixes the parity by regenerating it from the data blocks."

(They don't seem to give any reference as to where the scrubbing algorithm containing this step comes from.)

Personally I would not want my RAID system to do that -- by definition this situation should indicate an inconsistency which the storage system cannot correct on its own. An error should be reported to the user and the data should be restored from backups.

In that second paper they go on to describe the state being "fixed" as "parity inconsistency" and they go on to describe it as follows:

"+ Parity inconsistencies (PIs): This corruption class refers to a mismatch between the parity computed from data blocks and the parity stored on disk despite the individual checksums being valid. This error could be caused by lost or misdirected writes, in-memory corruption, processor miscalculations, and software bugs. Parity inconsistencies are detected only during data scrubs."

Indeed none of these errors can be reported by the disk or controller when the data is read, and thus none of these errors can be reliably corrected by simply re-computing and re-writing the parity! I would claim that without external means to verify the integrity of these data blocks it is impossible to know whether or not the new parity block is being computed from valid data. The on-disk ECC/checksums (and perhaps even block checksums) are not sufficient for this purpose. These kinds of errors are outside the scope of RAID protection and foiling the RAID system by re-computing the parity block does not fix anything.

They even seem to say so much in that second paper, though at the same time glossing over the issue I raise:

        5.2 Parity Inconsistencies

        These errors are detected by data scrubbing.  In the absence
        of a second parity disk, one cannot identify which
        disk is at fault.  Therefore, in order to prevent potential
        data loss on disk failure, the system fixes the inconsistency
        by rewriting parity. This scenario provides further
        motivation for double-parity protection schemes.


        These results assume that the parity disk is at fault.  We
        believe that counting the number of incorrect parity disks
        reflect the actual number of error disks since:  (i) entire
        shelves of disks are typically of the same age and same
        model, (ii) the incidence of these inconsistencies is quite
        low; hence, it is unlikely that multiple different disks in
        the same RAID group would be at fault.

Their own analysis of other types of errors suggests to me that it is not actually safe to assume that the parity disk is at fault. A detectable latent sector error in the parity disk would have been caught by the initial steps in the scrubbing. The probability of any undetected error must therefore be distributed evenly across all the disks involved, making it unsafe to re-compute and re-write the parity block.

They finally say this in section 4.7:

        + Parity pollution:  We believe that any parity-based
        system that re-uses existing data to compute parity is
        potentially susceptible to data loss due to disk errors, in
        particular lost and misdirected writes.  In the absence
        of techniques to perfectly verify the integrity of existing
        disk blocks used for re-computing the parity, disk scrubbing
        and partial-stripe writes can cause parity pollution,
        where the parity no longer reflects valid data.

So, (here's where my gut feeling is) w.r.t. implementing some form of scrubbing in RAIDframe, it would be my guess that so long as this "PI" fixing step discussed above is _not_ included, then scrubbing will still improve the reliability of RAIDframe storage systems (especially in those cases where all the data is not regularly re-read in such a way that user reads would detect latent sector errors before multiple errors could occur cause data loss). I.e. RAIDframe with disk scrubbing (but no PI fixing) should be no more susceptible to data loss or corruption than it would be without, and it should be more robust to latent sector errors developing in parallel on multiple disks while data sits idle.

In light of the probability of the other kinds of errors these authors discuss it would probably also be good if some form of read-verify protection could be included in RAIDframe too.

It might also be interesting if some form of block integrity checksum could be included too, i.e. a checksum stored in an additional sector allocated at the end of every block (stripe). (This also raises the issue of whether or not NetBSD+RAIDframe could make use of, say, 520- byte sector drives such that the checksum could be appended to the block and not require a whole additional sector, i.e. as described in 2.2.1 of ref[3].)

The partial stripe-write issue, which seems to be independent of scrubbing, might also need addressing in some way. My naive understanding of their state diagrams and RAID parity algorithms seems to suggest that only one form of partial stripe write re-computation of the parity can cause data loss, but I'm not sure if that method can be avoided.

Of course data scrubbing for various RAID mirroring (i.e. forms with no parity) would seem to always be a win-win feature.

                                        Greg A. Woods; Planix, Inc.

Attachment: PGP.sig
Description: This is a digitally signed message part

Home | Main Index | Thread Index | Old Index