port-sparc: Re: RAIDFrame and NetBSD/sparc booting issues

Subject: Re: RAIDFrame and NetBSD/sparc booting issues
To: Greg Oster <oster@cs.usask.ca>
From: Greg A. Woods <woods@weird.com>
List: port-sparc
Date: 08/12/2003 13:13:06
[ On Tuesday, August 12, 2003 at 08:55:03 (-0600), Greg Oster wrote: ]
> Subject: Re: RAIDFrame and NetBSD/sparc booting issues 
>
> How does it identify "components on other disks"?  Simply by serial 
> number?  That still requires putting a label in each component, which 
> is something I'd like to move away from.

What do you mean by "label" in the above?

Don't you need some kind of label, with a serial number in it, to
identify each component should they move about in the hardware
attachment namespace?

As soon as you divorce the component labels from the disk slices which
are the actual components and you try to keep all the configuration
information for a raid set in just one additional slice on each physical
device where the component slices live then I think you end up with
RAIDframe having to know about how slices are assigned to physical
devices in order that it can know which components to record in each
label slice.  As-is, IIUC, RAIDframe just sees component devices as
chunks of disk of a certain size and for all it really cares all the
components of any RAID set could be all on the same physical volume.
However as-is with one component label in each component slice the
separation of components onto separate physical devices means that those
components can move about in the device namespace and so long as they
can all (or enough of them can) be found then the RAID set can be
re-configured and used.  (I haven't looked at exactly how RAIDframe
scans for components during its auto-detect and auto-config procedure,
but I assume it just iterates across all disks and looks for all "RAID"
slices in their partition tables but that it doesn't really learn how
the comonent names it used intially are related to the physical disks.)

I would suggest that you really don't ever want to have to make any
assumptions about device node file naming conventions and their
relationships to physical devices; and you really shouldn't even try to
make any assumptions about device major & minor number allocation and
their relation to physical devices.  I think keeping things as they are
with one component label per component really is the safest possible way
of doing things in a platform and hardware independent manner.

> This still makes it hard to "merge" an existing FFS (or whatever) 
> partition into a RAID 1 set.

I don't think it would be hard at all and that's the very least of my
worries.  In any kind of serious system it's downright trivial to do
since that last filesystem won't be the only other storage on the whole
system (just copy the data over to some other storage, re-size the
partitions, and re-newfs the mirrored filesystem via the RAID-1
partition, then copy the data back.

Even in the case of something like a router or gateway box it's still
trivial to do if you're doing it during the initial install since that
"last" partition will usually be for data and/or applications that's not
yet present and thus there should be room on the root filesystem to
squirrel away a copy of whatever directory structure the install laid
down.

Sysinst could also be taught to always leave a spare track at the end of
the disk too....

There are several other ways to do this kind of thing too by
manipulating the existing filesystem.  In theory only if the last
partition was too full to give up the 64 sectors (or whatever) for the
component label are you stuck with no way to turn an existing system in
to a RAID-1 mirrored system.  A tool that could do this would even allow
for converting a root disk into one with one RAID-1 per filesystem.


I'm assuming of course that one would use only one RAID-1 logical volume
per pair of disks.  I haven't seen any convincing argument for using
multiple logical volumes with one per filesystem and yet I see several
theoretical and at least two major practical reasons to keep things
simple and only use one LV in this scenario.

Of course it's very nice to have the ability to use partitions (instead
of whole physical devices) as RAID components, and of course it makes
sense as well, but for the specific purpose of increasing the
availability of a system in face of "minor" hardware failures it doesn't
make sense to go hog-wild with such a feature and create separate logic
RAID-1 partitions for every slice of a pair of disks -- I'm prefectly
happy with imposing a restriction like this on folks (such as myself)
who want to make it as easy as possible to create RAID-1 volumes for
existing systems and to do so in a completely platform-independent
manner.  The sheer elegance and simplicity of it far outweigh any
argument I can imagine against doing it this way.  Remember we're only
talking about the one pair of disks used to initially boot a system.
Keeping things simple and transparent in this situation is of paramount
importance.

Indeed wouldn't this idea of moving the component label to the end of
the partition allow the whole-disk partitions, i.e. /dev/sd0c (/dev/wd0d
for those i386 guys) to be used as the components and thus even the boot
blocks would all be mirrored transparently, even if they're re-installed
via /dev/raid0?  (which would of course be the natural and most obvious
way to re-install the boot block on a RAID-1 root disk)

The more I think about this the better I like it -- it's just so
completely tranparent to all the operations one does with disks that it
eliminates all the confusion and hassle we've seen to date (e.g. with
several people independenting finding it necessary to write and publish
HOWTO and FAQ documents).

We wouldn't even need the hack of allowing boot blocks to pretend RAID
slices are equivalent (except for the hidden FS offset) to 4.2BSD slices
if they can find a second-stage boot program and load it.

I think we could even automate the handling of the reconstruct should
one component be booted and used alone -- an /etc/rc.d script could
check whether the root filesystem is a RAID-1 or not and if not then it
could look at the disk label of the root disk and if it finds a
whole-disk RAID component then it could look for a similar label on
another device and try to mark that other component as failed.  This
process could be configured with explicit knowledge of what the device
pairs are for the RAID-1 volume to make it safer, and it could be made
more reliable with the addition of a tri-state flag instead of a "good"
or "failed" state for each component where one component could claim
superiority even if the other were still "good" and thus even if the
other cannot be explicitly marked as "failed" the existing one can be
marked as having been used more recently.  If both end up in the
"superior" state then they're reset to "good" and assumed to be
consistent (which would only hose the system if somehow you booted each
component disk independently into multi-user mode in succession but each
time the other was not accessible and could not be marked as "failed").

> If one has to put a component label at the beginning or end of a 
> partition, it's much harder to do the above.  (esp. with live disks 
> that contain existing data...)

At the end of a partition is always the better place for "additional"
information for anything like a software RAID-1 label.  Putting this
kind of information at the beginning of the partition jiggles everything
else that thinks it can be at the beginning of the partition.

-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>