Subject: Re: RAIDFrame and NetBSD/sparc booting issues
To: NetBSD/sparc Discussion List <port-sparc@NetBSD.ORG>
From: Greg Oster <oster@cs.usask.ca>
List: port-sparc
Date: 08/13/2003 16:17:16
[This message is.... Long.  Those of you not interested in 
"RAID-1 booting with RAIDframe, the mini-series" are well-advised 
to hit "next" now :)  GO ]

"Greg A. Woods" writes:
> [ On Tuesday, August 12, 2003 at 16:17:48 (-0600), Greg Oster wrote: ]
> > Subject: Re: RAIDFrame and NetBSD/sparc booting issues 
> >
> > That would be correct.  A slightly different set of deck chairs to 
> > shuffle :)
> 
> OK, well that's what I thought.  So the question becomes:  what good
> would it really do to aggregate the component labels into a new type of
> partition that would have to be allocated on every physical disk
> containing any RAID components?

Only *one* of all the existing partitions would have to change.

> It sounds to me as though the onus would be on the user to make sure to
> define a metadata partition before configuring any RAID sets,

Right.  (e.g. as you would have to do under Tru64 and Solaris(?))

> as well as
> defining all the component partitions of course, and to make sure that
> "raidctl" is somehow told to put the right component labels in the right
> metadata partions when the RAID volumes are initially created.

"raidctl" doesn't deal with component labels.  Perhaps it could, at 
some point, but right now those are the exclusive domain of the driver.

>  I.e. it
> sounds to me as though using a single metadata partition for each disk
> containing any RAID components would make everything about designing and
> implementing a system configuration using RAIDframe a whole lot more
> complicated than it really needs to be (and than it already is :-).

Such a view may be in the eye of the designer/implementer ;)  

> Keeping things simple and straight forward with one component label per
> component and keeping it it stored within the component partition, seems
> like the best thing to do by far.

There is still only one component label per component.  The *biggest* 
drawback is that you wouldn't be able to change a partition label 
(e.g. move "e" to "f" or something) without having to update the 
metadata!  This is (yet another) reason why I havn't implemented the 
metadata stuff yet... (You may recall a thread from a while back on 
integrating such "RAID metadata" with "disklabels".  Such integration 
would make this problem a non-issue, as the "disklabel" program could 
be easly taught to deal with it...)
 
> > When it's doing autoconfig, it doesn't really care what the component 
> > name is.  It saves that information so that it can figure out what 
> > component the user is talking about, but that's about it.
> 
> I would think that being able to know the user-land name for a component
> (or at least what it was when the component was initially configured)
> would be parmount to having decent error reporting.

Right.
 
> > And we can do that now! :)
> 
> Yes, exacly my point!  The procedure we could use for converting any
> system to use RAID-1 is already well known by every sysadmin familiar
> with repartitioning a disk.  Well, except for the part about getting an
> arbitrary system to boot a RAID-1 volume containing the root filesystem
> of course.  That's where putting the component label at the end of the
> component partition comes in.  That one simple change instantly makes it
> possible to easily convert any existing NetBSD system, sparc or
> otherwise, into a RAID-1 root system by simply adding another physical
> disk of the same (or larger) size as the existing boot disk and then if
> necessary by fiddling in some very minor way with only the "last"
> existing parition on the existing boot disk (and then configuring the
> RAID set and installing and booting a RAIDframe autoconfiguration
> capable kernel of course :-).  No need for new boot blocks, no need for
> shuffling the position of the root filesystem, no need for tricky
> calculations to find the location of the dump partition, no confusion
> over where to find the root filesystem without the help of RAIDframe,
> etc.
> 
> (except perhaps those systems which already have boot sectors which know
> about the hack of adding 64 to the offset of a RAID partition to find
> the FFS filesystem within....  but maybe there's a trick we could play
> on them too)

Those just look at "partition a" first, right?  In that case, change 
"a" to be "FFS" (or whatever), and create some other partition that 
exactly overlaps "a", but is marked as type "RAID".  The arch will 
boot from "a" as per normal, and RAIDframe will still auto-configure 
based on the "RAID" marking.  (I think this will work.. havn't had 
time to think about this much...)

> > I'm sure I've given some reasons for "one filesystem per RAID set" 
> > before, but here are a few again:
> >  1) Sets with / and swap get into "full redundancy" much sooner.
> 
> Well, yes, OK, if you don't have to check and possibly re-construct
> parity for the swap volume after a reboot then you save that many I/Os.
> However how does one do this?  I understood that parity always had to be
> checked for RAID-1 volumes even if you first wrote zeros over both
> components.

It will still need to check them... (I havn't done the "raidctl -Z" 
to zero a RAID set...) 

> I'm not sure this is a huge win though, especially not for my systems.
> I don't expect them to crash very often (and so far they don't :-).
> Indeed it is the fact they run reliably for so long that I'm interested
> in using RAID-1 on them to make them more resilient to what I expect
> will be their most likely first critical failure (i.e. hard disk media
> error).
> 
> >  2) A failure in one part of one disks doesn't cost redunancy or 
> > performance in the other RAID sets. 
> 
> As I wrote elsewhere recently I don't buy the redundancy argument at
> all, at least not for random media errors.  If you haven't replaced the
> first failed drive before the second starts to fail then all bets are
> off regardless of where the failures occur on the second disk.  Your
> chances of squeaking through are really no better at all, and certainly
> there's no worthwhile benefit here from a pure risk assesment P.O.V.
>
> As far as performance goes for RAID-1 sets, isn't it actually faster
> with a failed component, at least for writes?

*marginally* faster, at best.  Multiple reads will typically be much slower
that would be the case in non-degraded mode.
 
> >  3) RAIDframe throttles the number of IO's that can go to each 
> > individual RAID set, so more RAID sets means more IO.
> 
> hmmm...  well that would be a good reason, except for the fact that it's
> an implementation policy which, IIUC, could be changed, or at least
> tuned more appropriately.  :-)
> 
> >  (and likely 
> > more head thrashing, but that's a slightly different issue :) )
> 
> I've thought about this quite a bit since building a system with one
> RAID-1 volume per filesystem and then observing it work when it had to
> page rather heavily.
> 
> >From that experience, and even without having hard evidence from disk
> access traces, I can only conclude that multiple RAID-1 volumes per
> physical disk only makes head contention worse. 

Um... "worse than what"??  Worse than separate partitions in a RAID 
Set?  My next question would be: Is that performance better or worse 
than separate partitions on only a single disk? 

> The application on this
> particular machine was a caching named.  The disk contention was caused
> by having /var/log and swap on the same spindle (and with lots of syslog
> activity from named).  At times the DNS query response time rose into
> multiple seconds even for locally authoritative records.  Even on a
> slower CPU and slower disk, but with similar paging activity the
> previous incarnation of the system would never exceed a couple of
> hundred milliseconds response time delay.  I don't see how having to do
> two writes for every page out to two different spindles on Ultra/WIDE
> SCSI drives could cause response times to increase by over an order of
> magnitude if head positioning contention weren't made catastrophically
> worse in some way. 

Short of testing the various combinations, it's very hard to tell.  A 
different system is going to have different dynamics,  If one of 
the "new things" is a RAID-1 set, then there is a pretty good chance 
that the heads will get separated for doing multiple reads, and that 
you'll have to "wait" for them to get together in order to do writes.
Again, depending on the dynamics of the system, you may have to wait 
more or less.

> I suppose I could break the /var (and swap?)
> mirror(s) and force the machine to page again to test this theory
> without having to re-install the machine from scratch again.
> 
> >  4) In the event of "something going wonky" on a server, it's often 
> > possible to unmount data filesystems and unconfigure their associated 
> > RAID sets before attempting a reboot.  If the reboot fails (see 
> > "something going wonky") then you don't have to rebuild the parity on 
> > the RAID sets that were unconfigured.
> 
> Is that really going to save much time if you have to carefully but
> manually undo part of the configuration?  How many users are skilled
> enough to do this kind of thing under pressure?  How often might this
> happen in the real world where one is using RAID-1 volumes to help
> increase the relative availablity of critical servers that don't usually
> suffer the kind of ongoing changes that might cause something to go
> wonky in the first place?

As disks get bigger faster than IO rates improve, it typically takes 
longer to do the parity checking.  The longer this window, the more 
time when the system is vulnerable in the event of a disk failure.
Now yes, hopefully the time to do the rebuild is *much* smaller than 
the typical system uptime, but it does increase the window of 
vulerability...

> I'm not entirely opposed to using one RAID volume per filesystem -- I'm
> just not entirely convinced that it's the best or only way to go for
> building systems where the use of RAID (level 1 in particular) has the
> primary goal of increasing apparent system availability (i.e. permitting
> scheduling of downtime for what we all no doubt perceive as one of the
> most common hardware failure risks).
> 
> Indeed if the primary goal of using RAID-1 for the root filesystem is
> only to improve data integrity on that filesystem, and if one is willing
> to take a panic() for any swap I/O error, then using RAID-1 just for the
> root filesystem (and, say, RAID-5 for user data), while "striping" swap
> across all the disks with the basic multiple-swap area config could make
> sense (as it would be much faster in heavy paging scenarios).  Even in
> that case though I think putting the component label at the end of the
> RAID-1 components in order to facilitate full transparency for disk
> labels, boot sectors and the root filesystem, would be well worth the
> effort.
> 
> > Well... "scalable" and "easy to expand once you figure out what it is 
> > you really want to do" are also important..  I'd hate to go with the 
> > super-simple solution if it means that 99% of people who setup a RAID-1 
> > set end up having to re-do an install one month later when they figure 
> > out what they really want.  We already have a "simple" implementation 
> > of a component label.  It turns out that said implementation isn't 
> > compatible with the nuances of various archs. 
> 
> I think that's one of the other really elegant things about what I'm
> proposing:  it really is easily scalable and adaptable.  Changing a
> system configuration down the road is not more difficult than it would
> be to do if RAIDframe were not in use at all (just a little more time
> consuming).  Everything stays exactly the same as it would without use
> of RAID-1 until the last minute when you plop in the second drive and
> turn on RAID-1.  If you want to repartition your system disk then you
> just turn off RAIDframe, do what you want on your primary drive (keeping
> only the extra whole-disk "RAID" partition and reserving the last 64
> sectors of the disk), and then when you're happy again you simply turn
> on RAIDframe again and re-construct the second drive.

This works so easily at this high level :)  Let's have a quick boo at 
another detail -- a detail which causes problems for the "component 
labels in a metadata partition" as well:  The size of the data portion 
of a RAID set is a multiple of the stripe width.  Thus, for a given 
partition size, the area available for FFS or LFS data will be different, 
depending on whether the width is 128 or 64 or 32 blocks.  So one 
needs to guarantee that the "end of the filesystem" falls within this
restriction as well, otherwise you'll be missing some bits....

> I think I would find re-configuring a system with one RAID-1 volume per
> filesystem a lot more complex and difficult to do.

It would be.
 
> > Hmm... what do the disklabels look like in this case?  Do you have a 
> > "sd0h" that is exactly the same as "sd0d", but is of type DT_RAID?
> > (and, then, when you do a 'disklabel raid0', you end up with the same 
> > label that is on sd0!!??)
> 
> I was under the impression that the disklabel from the /dev/raidN device
> always came from an in-core version constructed from information in the
> component label -- is this not so?

The disklabel for a RAID set actually lives "whereever a disklabel 
would live for that arch" on the RAID set.  It's most definitely not 
just 'in-core'.

> In any case the partition table for the RAID-1 volume has to be
> identical to that of the component disks for each partition containing
> OS data (filesystems, swap, etc.), at least w.r.t. offset and size.
> That's the whole idea -- the partition table in the RAID-1 volume must
> look, at least w.r.t. the offsets and sizes of the partitions containing
> filesystems and swap, exactly like the partition tables of the two
> drives holding the component volumes since this is what makes it
> possible for RAID-1 to be used completely transparently on the root disk
> of any existing system.

Having both a physical disk and a RAID set sharing the same disklabel 
makes me quite uneasy right now.  I'm sure I don't understand the 
complete ramifications of doing that, and what sorts of problems it 
may cause in the future.
 
[snip]
> > And this really only works with 
> > RAID-1.
> 
> Yes, of course, but (at least for now) that is really the only important
> place it needs to work.

"at least for now" implies that we need something that scales to the 
other RAID types ;)  (Note that I'm not at all convinced that we need 
to worry about booting from anything other than a RAID 1 set, however.)
 
> >  For RAID 0 and others, you still need to leave room for a 
> > disklabel that is *outside* of the area used for component data.
> 
> Yes of course, but I don't think it is necessary to try to support
> booting from a RAID-0 volume, at least not in a transparent way like
> this (maybe with explicit boot sector support).
> 
> Booting from a RAID-5 volume is more interesting, at least to me, but
> again I think it will require serious boot sector support and I don't
> expect this to be simple unless you go the Linux MILO way where the boot
> loader has at least half the functionality of a full kernel, if not
> more.

The boot loader might as well be the kernel.  
 
> > Consistency in the location of component labels is important.  
> 
> Do you mean consistency between volumes with different RAID levels
> (parity configurations)? 

Yes.

> Why is this important?  

The fewer the number of "special cases", the happier things usually 
are.  Adding a "new place to look" for component labels is a big
move -- it's something that has to be supported for life in a 
backwards-compatible way.  The fewer different places, the easier it 
is to stay backwards-compatible (and the leaner we can keep the 
kernel).

> Didn't you say it
> would be possible to make this backward compatible in any case by
> checking for a component lable at both the beginning and end of the RAID
> partitions?

Yes.  You can look at the "end" first, and if there is a label there, 
that's the one you use.  Otherwise you look at the beginning.

> > Component labels at the end of partitions doesn't help for non-RAID-1 
> > sets.
> 
> No, of course not.
> 
> >  If we were just dealing with RAID-1, component labels at the 
> > end would be much easier to handle.  
> 
> Yes, that's what I mean to propose -- just adding support for placing
> and detecting component labels of RAID-1 volumes at the end of the
> components so that offsets and sizes of the mirrored filesystems can
> remain identical with what one sees natively on either disk.

Ignoring some of the other details (e.g. where the boundary of the "data 
part" of a component lies), I think I could see doing this for *only* RAID-1.

A hard part will be in making sure that the RAID 1 sets ignore 
rf_protectedSectors, but only if the component label is at the end!  
All other RAID types must continue to obey it!  Right now 
rf_protectedSectors is a constant across all of RAIDframe.  

There would also be configuration issues (e.g. how to specify that 
you want the configuration label at the *end* of the component, 
instead of the beginning), and then, of course, allowing the creation 
of an empty RAID-1 set, and adding in existing components.
 
> > Some additional things to ponder:  Some disks get configured with 
> > both RAID-1 and RAID-5 components present on the same disk.
> 
> Yes, in theory the RAID-1 set doesn't have to extend to the end of the
> disk -- just over all the partitions one wants to mirror using RAID-1 as
> part of the boot disk, which could be just the root filesystem and swap,
> or even just the root filesystem.

What does 'raid0d' (in i386 land) cover in that case?  If it's 
sharing the disklabel with a physical disk, one of the 'd' partitions 
(either (e.g.) sd0d or raid0d) *has* to be wrong!?!?!?!  And the 
disklabel for the RAID set *must* be writeable if you intend to allow 
the user to have multiple partitions per RAID set.   Or is RAIDframe 
supposed to do some "disklabel magic" and do on-the-fly editing of 
the label so that the label that appears from 'disklabel raid0' looks 
different from 'disklabel sd0', yet when you do the 'disklabel -r -R 
raid0 /tmp/label' it "does the right stuff" to create partitions 
suitable for sd0????  Again, I'm not getting a warm, fuzzy feeling 
here..  (well... I'm warm, but that's due to the temperature outside, 
not this discussion :) ).
 
> I wrote this little blue-sky idea in my notes directory while composing
> the previous reply:
> 
>   RAID-1+ -- "many component" RAID-1
> 
>   Implement a queue or transaction log such that a system could do lazy
>   updates of multiple additional RAID-1 mirror components so that one
>   could build a many-component RAID-1 volume without the performance hit
>   of having to write to all components simultaneously.
>   
>   This would allow for stripped reads to be spread across all components
>   and assigned to components based on the address of the last sector
>   accessed on each component (thus reducing seek time).

Well.... only stripped reads on those sectors that got written to 
*all* of the components.  This sort of "lazy update" can get really 
tricky, especially in the event of a component failure.  (If the 
"master" fails, and you havn't written tot he "mirrors" yet, where is 
the data?)  How does this "lazy writing" mess up assumptions made by 
FFS or FFS+softdeps or LFS or whatever?  In particular, when the 
master dies, how is data "recovered" from one (or more) of the 
mirrors?  (esp. if the master fails to light up after a sudden power 
failure...)

>   This way even very large disks could be used efficiently in a system
>   with one RAID-1+ volume for the base OS and then the rest of each disk
>   can be aggregated together into a RAID-5 or a RAID-1+0 volume for user
>   data.
>   
>   Each disk is identical in layout, and bootable (though on busy systems
>   all but the first two may be slightly out of date) and the remaining
>   part of the disks holding OS volume is used more efficiently.
> 
> 
> > Does this become a "special case"?
> 
> It depends on what you mean I guess....  :-)
> 
> >  Will the 
> > "default install" be able to cover this?
> 
> The only change necessary to the install procedure is to always reserve
> (i.e. avoid using in any filesystem partition) those 64 sectors at the
> end of what will become the RAID-1 component(s).  

It may need to be more than that, depending on the stripe width.

> This could be done
> manually now by simply subtracting 64 sectors from the size of the
> "last" partition and it can be documented by defining a little "unused"
> partition to "occupy" that space.

I suppose if you said "64+128 in reserve", then you can *always* be 
guaranteed a maximum stripe width of 128 for the underlying data, 
plust have room for the label.  (i.e. if the size of the FFS is n 
blocks, then n+128 will guarantee that you can have at least m 
stripes of 128 blocks that will hold all n blocks of data (where m
=((n+128)/128) or so.).

>  Assuming sysctl allows you to choose
> "RAID" as a partition type even the RAID-1 component partition can be
> defined during install as well.
> 
> > If a RAID-1 set is "grown", how easy is it to get the component label 
> > to move "at the same time" as the disklabel changes?
> 
> Why bother trying to move the component label?  For a RAID-1 volume it's
> easier just to start from scratch.  Yes, this does mean re-constructing
> the whole mirror again when you're done, but I think that's a small
> price to pay for the simplicity and elegance otherwise achieved. 

Whatever happened to our high-availability!!??! ;)  

[snip]
> > (I do agree they would make booting from RAID-1 sets easier, but...)
> 
> I think the transparency of booting RAID-1 sets is only a small part of
> the total gain (though it is more significant on those types of systems
> which cannot directly boot from RAID-1 root filesystems).
> 
> The simplicity and elegance and ease of setup are very significant
> factors too, and for anyone running existing production servers the
> ability to "trivially" make use of RAID-1 is the primary benefit.  

I'm still not sold on "trivially" :)

> I've
> got two SPARC servers here right now that already have mirror disks
> installed and ready to use but in working out the details of how to
> implement RAID-1 on them I had pretty much come to the conclusion that I
> would have to re-partition and re-install in order to make it work

"Yup".  The flip-side, of course, is that by the time this would get 
implemented and into a release, how many systems will still be 
running on "single drives", and/or not be ready for a hardware 
upgrade anyway? (where the new hardware could be configured to use 
RAIDframe as it exists now...)

> and
> the mere thought of having to do that on what I consider to be
> production servers was putting me right off the whole idea (even though
> such a task would be quite easy for both of these machines as they
> currently each have only about a half-dozen changed files in /etc, plus
> my own bare-bones home directory containing only copies of my default
> .profile, etc.).
> 
> Indeed I can soon configure at least one more, and maybe two more, less
> critical of my sparc systems with mirror drives to use for testing such
> a change to RAIDframe ....  :-)

By the time such a change makes it into RAIDframe, the amount of 
testing required by anyone else should be *very* minimal.  (This is 
exactly the sort of change that can make a Huge Mess of things, and 
so it had better be *very* well tested before it makes it into the 
tree...)

Later...

Greg Oster