Subject: Re: RAIDFrame and NetBSD/sparc booting issues
To: NetBSD/sparc Discussion List <port-sparc@NetBSD.ORG>
From: Greg Oster <oster@cs.usask.ca>
List: port-sparc
Date: 08/12/2003 16:17:48
"Greg A. Woods" writes:
> [ On Tuesday, August 12, 2003 at 08:55:03 (-0600), Greg Oster wrote: ]
> > Subject: Re: RAIDFrame and NetBSD/sparc booting issues 
> >
> > How does it identify "components on other disks"?  Simply by serial 
> > number?  That still requires putting a label in each component, which 
> > is something I'd like to move away from.
> 
> What do you mean by "label" in the above?

"component label".

> Don't you need some kind of label, with a serial number in it, to
> identify each component should they move about in the hardware
> attachment namespace?

Yes.  And that component label can live either in the component or in 
a metadata area (assuming people don't change partition labels..)
 
> As soon as you divorce the component labels from the disk slices which
> are the actual components and you try to keep all the configuration
> information for a raid set in just one additional slice on each physical
> device where the component slices live then I think you end up with
> RAIDframe having to know about how slices are assigned to physical
> devices in order that it can know which components to record in each
> label slice. 

That would be correct.  A slightly different set of deck chairs to 
shuffle :)

> As-is, IIUC, RAIDframe just sees component devices as
> chunks of disk of a certain size and for all it really cares all the
> components of any RAID set could be all on the same physical volume.

Right.

> However as-is with one component label in each component slice the
> separation of components onto separate physical devices means that those
> components can move about in the device namespace and so long as they
> can all (or enough of them can) be found then the RAID set can be
> re-configured and used.  

Right.

> (I haven't looked at exactly how RAIDframe
> scans for components during its auto-detect and auto-config procedure,
> but I assume it just iterates across all disks and looks for all "RAID"
> slices in their partition tables 

Yes.

> but that it doesn't really learn how
> the comonent names it used intially are related to the physical disks.)

When it's doing autoconfig, it doesn't really care what the component 
name is.  It saves that information so that it can figure out what 
component the user is talking about, but that's about it.
 
> I would suggest that you really don't ever want to have to make any
> assumptions about device node file naming conventions and their
> relationships to physical devices; and you really shouldn't even try to
> make any assumptions about device major & minor number allocation and
> their relation to physical devices.  I think keeping things as they are
> with one component label per component really is the safest possible way
> of doing things in a platform and hardware independent manner.
> 
> > This still makes it hard to "merge" an existing FFS (or whatever) 
> > partition into a RAID 1 set.
> 
> I don't think it would be hard at all and that's the very least of my
> worries.  In any kind of serious system it's downright trivial to do
> since that last filesystem won't be the only other storage on the whole
> system (just copy the data over to some other storage, re-size the
> partitions, and re-newfs the mirrored filesystem via the RAID-1
> partition, then copy the data back.

And we can do that now! :)
 
> Even in the case of something like a router or gateway box it's still
> trivial to do if you're doing it during the initial install since that
> "last" partition will usually be for data and/or applications that's not
> yet present and thus there should be room on the root filesystem to
> squirrel away a copy of whatever directory structure the install laid
> down.
> 
> Sysinst could also be taught to always leave a spare track at the end of
> the disk too....
> 
> There are several other ways to do this kind of thing too by
> manipulating the existing filesystem.  In theory only if the last
> partition was too full to give up the 64 sectors (or whatever) for the
> component label are you stuck with no way to turn an existing system in
> to a RAID-1 mirrored system.  A tool that could do this would even allow
> for converting a root disk into one with one RAID-1 per filesystem.
> 
> 
> I'm assuming of course that one would use only one RAID-1 logical volume
> per pair of disks.  I haven't seen any convincing argument for using
> multiple logical volumes with one per filesystem and yet I see several
> theoretical and at least two major practical reasons to keep things
> simple and only use one LV in this scenario.

I'm sure I've given some reasons for "one filesystem per RAID set" 
before, but here are a few again:
 1) Sets with / and swap get into "full redundancy" much sooner.
 2) A failure in one part of one disks doesn't cost redunancy or 
performance in the other RAID sets. 
 3) RAIDframe throttles the number of IO's that can go to each 
individual RAID set, so more RAID sets means more IO. (and likely 
more head thrashing, but that's a slightly different issue :) ).
 4) In the event of "something going wonky" on a server, it's often 
possible to unmount data filesystems and unconfigure their associated 
RAID sets before attempting a reboot.  If the reboot fails (see 
"something going wonky") then you don't have to rebuild the parity on 
the RAID sets that were unconfigured.

> Of course it's very nice to have the ability to use partitions (instead
> of whole physical devices) as RAID components, and of course it makes
> sense as well, but for the specific purpose of increasing the
> availability of a system in face of "minor" hardware failures it doesn't
> make sense to go hog-wild with such a feature and create separate logic
> RAID-1 partitions for every slice of a pair of disks 

I guess I'm non-sensical then, cause every machine I've setup I've 
used "one filesystem, one RAID set". :)

> -- I'm prefectly
> happy with imposing a restriction like this on folks (such as myself)
> who want to make it as easy as possible to create RAID-1 volumes for
> existing systems and to do so in a completely platform-independent
> manner.  The sheer elegance and simplicity of it far outweigh any
> argument I can imagine against doing it this way.  Remember we're only
> talking about the one pair of disks used to initially boot a system.
> Keeping things simple and transparent in this situation is of paramount
> importance.

Well... "scalable" and "easy to expand once you figure out what it is 
you really want to do" are also important..  I'd hate to go with the 
super-simple solution if it means that 99% of people who setup a RAID-1 
set end up having to re-do an install one month later when they figure 
out what they really want.  We already have a "simple" implementation 
of a component label.  It turns out that said implementation isn't 
compatible with the nuances of various archs. 
 
> Indeed wouldn't this idea of moving the component label to the end of
> the partition allow the whole-disk partitions, i.e. /dev/sd0c (/dev/wd0d
> for those i386 guys) to be used as the components and thus even the boot
> blocks would all be mirrored transparently, even if they're re-installed
> via /dev/raid0? 

Yes. 

Hmm... what do the disklabels look like in this case?  Do you have a 
"sd0h" that is exactly the same as "sd0d", but is of type DT_RAID?
(and, then, when you do a 'disklabel raid0', you end up with the same 
label that is on sd0!!??)

> (which would of course be the natural and most obvious
> way to re-install the boot block on a RAID-1 root disk)

> The more I think about this the better I like it -- it's just so
> completely tranparent to all the operations one does with disks that it
> eliminates all the confusion and hassle we've seen to date (e.g. with
> several people independenting finding it necessary to write and publish
> HOWTO and FAQ documents).

What about the disklabel confusion?  And this really only works with 
RAID-1.  For RAID 0 and others, you still need to leave room for a 
disklabel that is *outside* of the area used for component data.

> We wouldn't even need the hack of allowing boot blocks to pretend RAID
> slices are equivalent (except for the hidden FS offset) to 4.2BSD slices
> if they can find a second-stage boot program and load it.
>
> I think we could even automate the handling of the reconstruct should
> one component be booted and used alone -- an /etc/rc.d script could
> check whether the root filesystem 

That works for a root filesystem, but what about other random 
filesystems?

> is a RAID-1 or not and if not then it
> could look at the disk label of the root disk and if it finds a
> whole-disk RAID component then it could look for a similar label on
> another device and try to mark that other component as failed.  This
> process could be configured with explicit knowledge of what the device
> pairs are for the RAID-1 volume to make it safer,

But not moveable to another system?  And prone to fail if devices are 
removed/added to the system?

> and it could be made
> more reliable with the addition of a tri-state flag instead of a "good"
> or "failed" state for each component where one component could claim
> superiority even if the other were still "good" and thus even if the
> other cannot be explicitly marked as "failed" the existing one can be
> marked as having been used more recently.  If both end up in the
> "superior" state then they're reset to "good" and assumed to be
> consistent (which would only hose the system if somehow you booted each
> component disk independently into multi-user mode in succession but each
> time the other was not accessible and could not be marked as "failed").

I'm not sure that automating any of this is going to be easy.  If I 
had, I'd have implemented something by now :-}

> > If one has to put a component label at the beginning or end of a 
> > partition, it's much harder to do the above.  (esp. with live disks 
> > that contain existing data...)
> 
> At the end of a partition is always the better place for "additional"
> information for anything like a software RAID-1 label.  Putting this
> kind of information at the beginning of the partition jiggles everything
> else that thinks it can be at the beginning of the partition.

Consistency in the location of component labels is important.  
Component labels at the end of partitions doesn't help for non-RAID-1 
sets.  If we were just dealing with RAID-1, component labels at the 
end would be much easier to handle.  

Some additional things to ponder:  Some disks get configured with 
both RAID-1 and RAID-5 components present on the same disk.
(e.g. sd0a and sd1a are for a mirrored /, sd2b and sd3b are for 
mirrored swap, sd4b is for dumps.  sd0e, sd1e, sd2e, sd3e, and sd4e 
are for a RAID-5 set).  Does this become a "special case"?  Will the 
"default install" be able to cover this?   One can't simply put a 
component label "at the end of the disk, covering the whole disk". 
A separate meta-data partition can cover this, as can the current 
setup.

If a RAID-1 set is "grown", how easy is it to get the component label 
to move "at the same time" as the disklabel changes?  Unless the 
label is written *before* the disklabel is done, there will be a 
window of time in which a crash (or whatever) would leave the system
unable to find its component labels!

So no, I'm not at all convinced that component labels at the end of the 
partition will solve all the problems without introducing more... :)
(I do agree they would make booting from RAID-1 sets easier, but...)

Later...

Greg Oster