Re: Thoughts on large disk reliability and raid maintenance?

To: David Maxwell <david%crlf.net@localhost>
Subject: Re: Thoughts on large disk reliability and raid maintenance?
From: "Greg A. Woods; Planix, Inc." <woods%planix.ca@localhost>
Date: Thu, 11 Sep 2008 16:43:12 -0400

FYI, the Apple Xserver RAID boxes have some form of data scrubbing too.

On 11-Sep-08, at 1:13 PM, David Maxwell wrote:


Please be sure to read the following paper before doing any work on
this. It can do more harm than good. See section 4.2

http://pages.cs.wisc.edu/~krioukov/Krioukov-ParityLost.pdf

I think there are a couple of issues here, though I'm just going on mygut feeling from a first read of that paper. The main point though isthat data scrubbing need not do more harm, just so long as thepotentially harmful step in data scrubbing is not implemented.

The first thing that it was very good to be reminded of was thattraditional RAID does not guarantee protection from data corruptionfor anything more than latent (and detectable) sector errors and(detectable) drive/controller failures. For RAID to provide adequateprotection the hardware must report a read error whenever a disksector read would otherwise return incorrect data (or no data at all).

I'm not so sure though that I agree with their basic analysis thatscrubbing might increase the risk of data corruption, at least withinthe threat model that RAID is intended to protect from, i.e. that thehardware will indeed always report an error whenever a disk sectorread would otherwise return incorrect data. I think disk scrubbingwill only _increase_ the risk of loss if the other types of errors areconsidered, _and_ if the scrubbing step I identify below is included.

So, it also seems to me that the method of scrubbing described in thepaper seems flawed. I think it goes one step too far.

If I understand correctly the authors say as much in the followingsentence near the end of section 4.3:

"Any protection technique that does not co-operate with RAID, allowsparity recalculation to use bad data, causing irreversible data loss."

and it's probably not just irreversible, but undetectable as well (atleast within the storage subsystem itself).

I think the problem with the scrubbing scheme they describe is in thislast step:

"The scrub also re-computes the parity of the data blocks andcompares it with the parity stored on the parity disk, therebydetecting any inconsistencies[3]."

In reference [3], also by the same primary author, this procedure isfurther detailed as:

"If the parity does not match the verified data, the scrub processfixes the parity by regenerating it from the data blocks."

(They don't seem to give any reference as to where the scrubbingalgorithm containing this step comes from.)

Personally I would not want my RAID system to do that -- by definitionthis situation should indicate an inconsistency which the storagesystem cannot correct on its own. An error should be reported to theuser and the data should be restored from backups.

In that second paper they go on to describe the state being "fixed" as"parity inconsistency" and they go on to describe it as follows:

"+ Parity inconsistencies (PIs): This corruption class refers to amismatch between the parity computed from data blocks and the paritystored on disk despite the individual checksums being valid. Thiserror could be caused by lost or misdirected writes, in-memorycorruption, processor miscalculations, and software bugs. Parityinconsistencies are detected only during data scrubs."

Indeed none of these errors can be reported by the disk or controllerwhen the data is read, and thus none of these errors can be reliablycorrected by simply re-computing and re-writing the parity! I wouldclaim that without external means to verify the integrity of thesedata blocks it is impossible to know whether or not the new parityblock is being computed from valid data. The on-disk ECC/checksums(and perhaps even block checksums) are not sufficient for thispurpose. These kinds of errors are outside the scope of RAIDprotection and foiling the RAID system by re-computing the parityblock does not fix anything.

They even seem to say so much in that second paper, though at the sametime glossing over the issue I raise:


        5.2 Parity Inconsistencies

        These errors are detected by data scrubbing.  In the absence
        of a second parity disk, one cannot identify which
        disk is at fault.  Therefore, in order to prevent potential
        data loss on disk failure, the system fixes the inconsistency
        by rewriting parity. This scenario provides further
        motivation for double-parity protection schemes.

        [[....]]

        These results assume that the parity disk is at fault.  We
        believe that counting the number of incorrect parity disks
        reflect the actual number of error disks since:  (i) entire
        shelves of disks are typically of the same age and same
        model, (ii) the incidence of these inconsistencies is quite
        low; hence, it is unlikely that multiple different disks in
        the same RAID group would be at fault.

Their own analysis of other types of errors suggests to me that it isnot actually safe to assume that the parity disk is at fault. Adetectable latent sector error in the parity disk would have beencaught by the initial steps in the scrubbing. The probability of anyundetected error must therefore be distributed evenly across all thedisks involved, making it unsafe to re-compute and re-write the parityblock.


They finally say this in section 4.7:

        + Parity pollution:  We believe that any parity-based
        system that re-uses existing data to compute parity is
        potentially susceptible to data loss due to disk errors, in
        particular lost and misdirected writes.  In the absence
        of techniques to perfectly verify the integrity of existing
        disk blocks used for re-computing the parity, disk scrubbing
        and partial-stripe writes can cause parity pollution,
        where the parity no longer reflects valid data.

So, (here's where my gut feeling is) w.r.t. implementing some form ofscrubbing in RAIDframe, it would be my guess that so long as this "PI"fixing step discussed above is _not_ included, then scrubbing willstill improve the reliability of RAIDframe storage systems (especiallyin those cases where all the data is not regularly re-read in such away that user reads would detect latent sector errors before multipleerrors could occur cause data loss). I.e. RAIDframe with diskscrubbing (but no PI fixing) should be no more susceptible to dataloss or corruption than it would be without, and it should be morerobust to latent sector errors developing in parallel on multipledisks while data sits idle.

In light of the probability of the other kinds of errors these authorsdiscuss it would probably also be good if some form of read-verifyprotection could be included in RAIDframe too.

It might also be interesting if some form of block integrity checksumcould be included too, i.e. a checksum stored in an additional sectorallocated at the end of every block (stripe). (This also raises theissue of whether or not NetBSD+RAIDframe could make use of, say, 520-byte sector drives such that the checksum could be appended to theblock and not require a whole additional sector, i.e. as described in2.2.1 of ref[3].)

The partial stripe-write issue, which seems to be independent ofscrubbing, might also need addressing in some way. My naiveunderstanding of their state diagrams and RAID parity algorithms seemsto suggest that only one form of partial stripe write re-computationof the parity can cause data loss, but I'm not sure if that method canbe avoided.

Of course data scrubbing for various RAID mirroring (i.e. forms withno parity) would seem to always be a win-win feature.


--
                                        Greg A. Woods; Planix, Inc.
                                        <woods%planix.ca@localhost>

Attachment: PGP.sig
Description: This is a digitally signed message part

References:
- Thoughts on large disk reliability and raid maintenance?
  - From: Brian Buhrow
- Re: Thoughts on large disk reliability and raid maintenance?
  - From: David Maxwell

Prev by Date: Re: Eee 901 and lii(4)
Next by Date: Re: panic: biodone2 already
Previous by Thread: Re: Thoughts on large disk reliability and raid maintenance?
Next by Thread: Downgrade from -current?
Indexes:

Home | Main Index | Thread Index | Old Index