port-i386: Re: RFC: root on raidframe howto

Subject: Re: RFC: root on raidframe howto
To: NetBSD/i386 Discussion List <port-i386@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: port-i386
Date: 08/07/2003 14:13:02
[ On Wednesday, August 6, 2003 at 08:11:01 (-0500), Frederick Bruckman wrote: ]
> Subject: Re: RFC: root on raidframe howto
>
> 1) You have one RAID per disk, rather than one per file system.
> There has been a lot of discussion on this question in the past, but
> I don't recall exactly what the consensus was. :-) I went with one per
> file-system, with the idea that it would be more robust against the
> case where both disks begin to fail at about the same time.

I think having one RAID-1 per filesystem can only help with a few of the
many possible failure scenarios and meanwhile it would seem to add a
bunch of unnecessary overhead (multiple RAID devices instead of one),
though I'm not sure what that overhead amounts to in practice.

In the end even if "true" bad sectors start to appear on both disks at
the same time but in different areas I'm not sure it'll really be that
much easier to recover than restoring from backups or re-installing, and
doing it with only two drive channels (e.g. typical IDE/ATA systems)
could be a lot more painfull (though perhaps still not so painful as
loosing critical data).

In the end I see RAIDframe RAID-1 for boot/root/swap/var only as an
availability enhancer (helping to avoid some emergency downtime for what
are hopefully the more common failure modes), and not as a solution for
making critical data easier to recover after a more catastrophic
failure.  If both disks do start to fail at the same time then
effectively the attempt to use RAID-1 was futile and un-scheduled
down-time was not avoided, regardless of how they fail or whether or not
there's some chance that alternating filesystems on alternating disks
are recoverable or not (except maybe in the very rare circumstance where
the failure mode is a one-time media error that only affects one or a
very few sectors on each disks, but how can you be sure that's what's
happened and that the problem won't suddenly go catastrophic?).

In any case I do have one system configured this way at a client site
and it seems to be excessively slow when under load and paging and I was
wondering if maybe the effect of having multiple RAID devices was
causing contention with competing seeks in separate parts of the disks
more than might happen if the a single contiguous virtual disk "mirrors"
the underlying physical hardware.  I don't know enough about how NetBSD
does or does not aggregate I/Os to disks to understand this fully
though.

-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>