Subject: Re: RAIDFrame and NetBSD/sparc booting issues
To: Greg Oster <oster@cs.usask.ca>
From: Greg A. Woods <woods@weird.com>
List: port-sparc
Date: 08/13/2003 01:13:23
[ On Tuesday, August 12, 2003 at 16:17:48 (-0600), Greg Oster wrote: ]
> Subject: Re: RAIDFrame and NetBSD/sparc booting issues 
>
> That would be correct.  A slightly different set of deck chairs to 
> shuffle :)

OK, well that's what I thought.  So the question becomes:  what good
would it really do to aggregate the component labels into a new type of
partition that would have to be allocated on every physical disk
containing any RAID components?

It sounds to me as though the onus would be on the user to make sure to
define a metadata partition before configuring any RAID sets, as well as
defining all the component partitions of course, and to make sure that
"raidctl" is somehow told to put the right component labels in the right
metadata partions when the RAID volumes are initially created.  I.e. it
sounds to me as though using a single metadata partition for each disk
containing any RAID components would make everything about designing and
implementing a system configuration using RAIDframe a whole lot more
complicated than it really needs to be (and than it already is :-).

Keeping things simple and straight forward with one component label per
component and keeping it it stored within the component partition, seems
like the best thing to do by far.

> When it's doing autoconfig, it doesn't really care what the component 
> name is.  It saves that information so that it can figure out what 
> component the user is talking about, but that's about it.

I would think that being able to know the user-land name for a component
(or at least what it was when the component was initially configured)
would be parmount to having decent error reporting.

> And we can do that now! :)

Yes, exacly my point!  The procedure we could use for converting any
system to use RAID-1 is already well known by every sysadmin familiar
with repartitioning a disk.  Well, except for the part about getting an
arbitrary system to boot a RAID-1 volume containing the root filesystem
of course.  That's where putting the component label at the end of the
component partition comes in.  That one simple change instantly makes it
possible to easily convert any existing NetBSD system, sparc or
otherwise, into a RAID-1 root system by simply adding another physical
disk of the same (or larger) size as the existing boot disk and then if
necessary by fiddling in some very minor way with only the "last"
existing parition on the existing boot disk (and then configuring the
RAID set and installing and booting a RAIDframe autoconfiguration
capable kernel of course :-).  No need for new boot blocks, no need for
shuffling the position of the root filesystem, no need for tricky
calculations to find the location of the dump partition, no confusion
over where to find the root filesystem without the help of RAIDframe,
etc.

(except perhaps those systems which already have boot sectors which know
about the hack of adding 64 to the offset of a RAID partition to find
the FFS filesystem within....  but maybe there's a trick we could play
on them too)

> I'm sure I've given some reasons for "one filesystem per RAID set" 
> before, but here are a few again:
>  1) Sets with / and swap get into "full redundancy" much sooner.

Well, yes, OK, if you don't have to check and possibly re-construct
parity for the swap volume after a reboot then you save that many I/Os.
However how does one do this?  I understood that parity always had to be
checked for RAID-1 volumes even if you first wrote zeros over both
components.

I'm not sure this is a huge win though, especially not for my systems.
I don't expect them to crash very often (and so far they don't :-).
Indeed it is the fact they run reliably for so long that I'm interested
in using RAID-1 on them to make them more resilient to what I expect
will be their most likely first critical failure (i.e. hard disk media
error).

>  2) A failure in one part of one disks doesn't cost redunancy or 
> performance in the other RAID sets. 

As I wrote elsewhere recently I don't buy the redundancy argument at
all, at least not for random media errors.  If you haven't replaced the
first failed drive before the second starts to fail then all bets are
off regardless of where the failures occur on the second disk.  Your
chances of squeaking through are really no better at all, and certainly
there's no worthwhile benefit here from a pure risk assesment P.O.V.

As far as performance goes for RAID-1 sets, isn't it actually faster
with a failed component, at least for writes?

>  3) RAIDframe throttles the number of IO's that can go to each 
> individual RAID set, so more RAID sets means more IO.

hmmm...  well that would be a good reason, except for the fact that it's
an implementation policy which, IIUC, could be changed, or at least
tuned more appropriately.  :-)

>  (and likely 
> more head thrashing, but that's a slightly different issue :) )

I've thought about this quite a bit since building a system with one
RAID-1 volume per filesystem and then observing it work when it had to
page rather heavily.

From that experience, and even without having hard evidence from disk
access traces, I can only conclude that multiple RAID-1 volumes per
physical disk only makes head contention worse.  The application on this
particular machine was a caching named.  The disk contention was caused
by having /var/log and swap on the same spindle (and with lots of syslog
activity from named).  At times the DNS query response time rose into
multiple seconds even for locally authoritative records.  Even on a
slower CPU and slower disk, but with similar paging activity the
previous incarnation of the system would never exceed a couple of
hundred milliseconds response time delay.  I don't see how having to do
two writes for every page out to two different spindles on Ultra/WIDE
SCSI drives could cause response times to increase by over an order of
magnitude if head positioning contention weren't made catastrophically
worse in some way.  I suppose I could break the /var (and swap?)
mirror(s) and force the machine to page again to test this theory
without having to re-install the machine from scratch again.

>  4) In the event of "something going wonky" on a server, it's often 
> possible to unmount data filesystems and unconfigure their associated 
> RAID sets before attempting a reboot.  If the reboot fails (see 
> "something going wonky") then you don't have to rebuild the parity on 
> the RAID sets that were unconfigured.

Is that really going to save much time if you have to carefully but
manually undo part of the configuration?  How many users are skilled
enough to do this kind of thing under pressure?  How often might this
happen in the real world where one is using RAID-1 volumes to help
increase the relative availablity of critical servers that don't usually
suffer the kind of ongoing changes that might cause something to go
wonky in the first place?


I'm not entirely opposed to using one RAID volume per filesystem -- I'm
just not entirely convinced that it's the best or only way to go for
building systems where the use of RAID (level 1 in particular) has the
primary goal of increasing apparent system availability (i.e. permitting
scheduling of downtime for what we all no doubt perceive as one of the
most common hardware failure risks).

Indeed if the primary goal of using RAID-1 for the root filesystem is
only to improve data integrity on that filesystem, and if one is willing
to take a panic() for any swap I/O error, then using RAID-1 just for the
root filesystem (and, say, RAID-5 for user data), while "striping" swap
across all the disks with the basic multiple-swap area config could make
sense (as it would be much faster in heavy paging scenarios).  Even in
that case though I think putting the component label at the end of the
RAID-1 components in order to facilitate full transparency for disk
labels, boot sectors and the root filesystem, would be well worth the
effort.

> Well... "scalable" and "easy to expand once you figure out what it is 
> you really want to do" are also important..  I'd hate to go with the 
> super-simple solution if it means that 99% of people who setup a RAID-1 
> set end up having to re-do an install one month later when they figure 
> out what they really want.  We already have a "simple" implementation 
> of a component label.  It turns out that said implementation isn't 
> compatible with the nuances of various archs. 

I think that's one of the other really elegant things about what I'm
proposing:  it really is easily scalable and adaptable.  Changing a
system configuration down the road is not more difficult than it would
be to do if RAIDframe were not in use at all (just a little more time
consuming).  Everything stays exactly the same as it would without use
of RAID-1 until the last minute when you plop in the second drive and
turn on RAID-1.  If you want to repartition your system disk then you
just turn off RAIDframe, do what you want on your primary drive (keeping
only the extra whole-disk "RAID" partition and reserving the last 64
sectors of the disk), and then when you're happy again you simply turn
on RAIDframe again and re-construct the second drive.

I think I would find re-configuring a system with one RAID-1 volume per
filesystem a lot more complex and difficult to do.

> Hmm... what do the disklabels look like in this case?  Do you have a 
> "sd0h" that is exactly the same as "sd0d", but is of type DT_RAID?
> (and, then, when you do a 'disklabel raid0', you end up with the same 
> label that is on sd0!!??)

I was under the impression that the disklabel from the /dev/raidN device
always came from an in-core version constructed from information in the
component label -- is this not so?

In any case the partition table for the RAID-1 volume has to be
identical to that of the component disks for each partition containing
OS data (filesystems, swap, etc.), at least w.r.t. offset and size.
That's the whole idea -- the partition table in the RAID-1 volume must
look, at least w.r.t. the offsets and sizes of the partitions containing
filesystems and swap, exactly like the partition tables of the two
drives holding the component volumes since this is what makes it
possible for RAID-1 to be used completely transparently on the root disk
of any existing system.

The only difference is whether you mount /dev/raid0a or /dev/sd0a.  If
you mount the latter then you bypass RAIDframe and muck with one of the
disks directly and without RAIDframe knowing (which is what you might
want to be able to do if the mirror disk is dead and you don't want
RAIDframe complaining while you run on just one disks).  However if you
mount /dev/raid0a you access exactly the same sectors, but on both disks
"simultaneously".

The only thing I haven't figured out in my mind is whether /dev/sd0c
should include the last 64 sectors or not.  I think it should, but in
that case the /dev/raid0c size has to be adjusted down since it
logically cannot (be allowed to) access those last 64 sectors.

But yes, the labels on the component disks would look exactly as they
would if RAID-1 were not in use with the exception that there would be
an additional partition of type "RAID" which would cover the whole range
of the disk (just as the standard 'c' partition does and would continue
to do).  A RAIDframe kernel with "RAID_AUTOCONFIG" would spot this RAID
partition on the two mirrored root disks and would offer up /dev/raid0a
as the root partition because this RAID set would have been told to do
so with "raidctl -A root raid0".  However a "normal" kernel would simply
see that it was booted from /dev/sd0a (or /dev/sd1a) and would use that
as its root partition.  In both cases it would be the same resulting
root filesystem, at least until one of the components was written to
outside the purview of RAIDframe.  :-)

> And this really only works with 
> RAID-1.

Yes, of course, but (at least for now) that is really the only important
place it needs to work.

>  For RAID 0 and others, you still need to leave room for a 
> disklabel that is *outside* of the area used for component data.

Yes of course, but I don't think it is necessary to try to support
booting from a RAID-0 volume, at least not in a transparent way like
this (maybe with explicit boot sector support).

Booting from a RAID-5 volume is more interesting, at least to me, but
again I think it will require serious boot sector support and I don't
expect this to be simple unless you go the Linux MILO way where the boot
loader has at least half the functionality of a full kernel, if not
more.

> Consistency in the location of component labels is important.  

Do you mean consistency between volumes with different RAID levels
(parity configurations)?  Why is this important?  Didn't you say it
would be possible to make this backward compatible in any case by
checking for a component lable at both the beginning and end of the RAID
partitions?

> Component labels at the end of partitions doesn't help for non-RAID-1 
> sets.

No, of course not.

>  If we were just dealing with RAID-1, component labels at the 
> end would be much easier to handle.  

Yes, that's what I mean to propose -- just adding support for placing
and detecting component labels of RAID-1 volumes at the end of the
components so that offsets and sizes of the mirrored filesystems can
remain identical with what one sees natively on either disk.

> Some additional things to ponder:  Some disks get configured with 
> both RAID-1 and RAID-5 components present on the same disk.

Yes, in theory the RAID-1 set doesn't have to extend to the end of the
disk -- just over all the partitions one wants to mirror using RAID-1 as
part of the boot disk, which could be just the root filesystem and swap,
or even just the root filesystem.

I wrote this little blue-sky idea in my notes directory while composing
the previous reply:

  RAID-1+ -- "many component" RAID-1

  Implement a queue or transaction log such that a system could do lazy
  updates of multiple additional RAID-1 mirror components so that one
  could build a many-component RAID-1 volume without the performance hit
  of having to write to all components simultaneously.
  
  This would allow for stripped reads to be spread across all components
  and assigned to components based on the address of the last sector
  accessed on each component (thus reducing seek time).
  
  This way even very large disks could be used efficiently in a system
  with one RAID-1+ volume for the base OS and then the rest of each disk
  can be aggregated together into a RAID-5 or a RAID-1+0 volume for user
  data.
  
  Each disk is identical in layout, and bootable (though on busy systems
  all but the first two may be slightly out of date) and the remaining
  part of the disks holding OS volume is used more efficiently.


> Does this become a "special case"?

It depends on what you mean I guess....  :-)

>  Will the 
> "default install" be able to cover this?

The only change necessary to the install procedure is to always reserve
(i.e. avoid using in any filesystem partition) those 64 sectors at the
end of what will become the RAID-1 component(s).  This could be done
manually now by simply subtracting 64 sectors from the size of the
"last" partition and it can be documented by defining a little "unused"
partition to "occupy" that space.  Assuming sysctl allows you to choose
"RAID" as a partition type even the RAID-1 component partition can be
defined during install as well.

> If a RAID-1 set is "grown", how easy is it to get the component label 
> to move "at the same time" as the disklabel changes?

Why bother trying to move the component label?  For a RAID-1 volume it's
easier just to start from scratch.  Yes, this does mean re-constructing
the whole mirror again when you're done, but I think that's a small
price to pay for the simplicity and elegance otherwise achieved.  I
suppose if someone is willing to go to the work of supporting this kind
of change automatically then it could be a nice feature to have, but I
wouldn't even put it on the priority list of things to do at some later
date.

(As for implementing this safely, well disklabel would have to
understand at least something about RAID partition internals and it
would have to write the updated component labels in their new location
before updating the disk partition tables, and lastly you would zero out
the old component label sectors just for good measure.  This way if
there were a crash in between then you'd be back to where you started,
at least assuming you're expanding the RAID-1 volume to cover an unused
part of the drive.  If you're shrinking the RAID-1 volume then
presumably you've already moved the data off the partition being shrunk
and you realize that it may have been corrupted in this event.  In
theory such an intelligent version of disklabel would know enough to
simultaneously re-partition both disks in a RAID-1 volume too!  ;-)

> (I do agree they would make booting from RAID-1 sets easier, but...)

I think the transparency of booting RAID-1 sets is only a small part of
the total gain (though it is more significant on those types of systems
which cannot directly boot from RAID-1 root filesystems).

The simplicity and elegance and ease of setup are very significant
factors too, and for anyone running existing production servers the
ability to "trivially" make use of RAID-1 is the primary benefit.  I've
got two SPARC servers here right now that already have mirror disks
installed and ready to use but in working out the details of how to
implement RAID-1 on them I had pretty much come to the conclusion that I
would have to re-partition and re-install in order to make it work and
the mere thought of having to do that on what I consider to be
production servers was putting me right off the whole idea (even though
such a task would be quite easy for both of these machines as they
currently each have only about a half-dozen changed files in /etc, plus
my own bare-bones home directory containing only copies of my default
.profile, etc.).

Indeed I can soon configure at least one more, and maybe two more, less
critical of my sparc systems with mirror drives to use for testing such
a change to RAIDframe ....  :-)

-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>