Subject: Re: NetBSD iSCSI HOWTOs
To: None <>
From: Miles Nordin <carton@Ivy.NET>
List: current-users
Date: 03/01/2006 20:35:08
Content-Type: text/plain; charset=US-ASCII

>>>>> "rp" == Ray Phillips <> writes:

    rp> I'm afraid I'm one of those.  Could you elaborate on [the
    rp> RAID5 write hole]

I think this is where I first heard of it:

as I understand it, the RAID5 write hole is because RAID5 can't write
anything smaller than a stripe.  If you write a single sector, RAID5
has to read-modify-write the ~64kByte*(ndisks-1) (+- a few factors of
2) stripe in which the sector resides.  For one sector, you have to
read all n disks, then write to one data disk and one parity disk.

The ``write hole'' is during the Write phase of the read-modify-write.
If this write is interrupted, the whole stripe is corrupt, including
sectors to which the layers above the RAID weren't writing.
Filesystems that expect to write things smaller than a stripe will get
messed up by this.  This is the reason you need to have ``hardware
RAID'' with an NVRAM, so that there's a separate OS inside your RAID
card that you don't have to worry about crashing and corrupting a
stripe, and so entire stripes can be written to NVRAM and protected
from AC-cord-pulling before the disks are asked to begin writing them.

If something goes wrong during the write hole, you've lost the whole
~0.25MB stripe.  And you can't get out of it by ``rebuilding'' the set
at reboot time unless I'm missing something---the best you can do in a
rebuild is unambiguously mark the stripe bad so that reading it
returns an I/O error to the layer above RAID.

Without an NVRAM, the options I can think of are pretty dismal:

 * make a ``log'' area, and write stripes to the log, atomically
   commit them to the log, then copy from log to the data area of the

   1. write whole stripe to log
   2. [write barrier]
   3. mark step#1 log block valid with a single-disk-sector write
   4. [wait log enough that the disk's cache will certainly be
       synchronized without asking/blocking/waiting]
   5. copy from log to actual stripe

   so, there is only one [write barrier].  On reboot, you skip to step
   5, treating as empty any log blocks not marked by step #3.

   you could also use your log to open and close {disk,stripe} tuples,
   so each potential write hole will be done as:

   1. open disk3 in the log
   2. write stripe to disk3
   3. close disk3 in the log, and open paritydisk in the log
   4. write stripe to paritydisk
   5. close paritydisk in the log

   Then, on reboot you treat all the open tuples as ``bad disks'' for
   just that stripe, and rebuild them.

   with that, you have to [synchronize cache], and _wait for the cache
   to report that it's synchronized_, after each step, 5 times.  so
   while much less data is written to the log it may end up being more
   expensive, or even impossible if targets can't unambiguously report
   that the cache actually is synchronized.  The problem is that you
   are trying to do a write barrier across two disks rather than just
   send one down to a single disk.  In the scheme above we avoided
   that by step #4, just wait a while.

   There is no magic---either way is equivalent to just using an area
   of the disk as your NVRAM.  That area gets hit so hard for
   write-heavy jobs that it needs to not be on a disk.

   anyway I'm just making this up and haven't looked at any actual
   implementations, so I may have gotten something wrong, but at least
   you have some idea of the obstacles to RAID5 without NVRAM.

 * combine filesystem and RAID layers like Sun did.

   And I think also like FreeBSD did with graid3 (they increase the
   sector size geom(4) presents to UFS, so that UFS sees sectors that
   are full stripes.)  The FreeBSD way sounds like it means the RAID
   cannot give you any increase in seek bandwidth like traditional
   RAID5 can.  not sure about ZFS/raidZ and seek bandwidth.

Content-Type: application/pgp-signature
Content-Transfer-Encoding: 7bit

Version: GnuPG v1.4.2 (NetBSD)