Subject: Re: Two RAIDFrame Problems -- Reconfigure + Hot Spare
To: Rob Ginn <rob@sun701.nawcad.navy.mil>
From: Greg Oster <oster@cs.usask.ca>
List: netbsd-users
Date: 05/22/2002 09:08:19
Rob Ginn writes:
> Hi,
> I'm having two problems with the RAIDFrame system
> under NetBSD.  The first deals with changing the
> configuration of a RAID device, the second with the
> operation of the hot spare.  I've spent a week on it
> and any help would be much appreciated.  I'm running
> NetBSD 1.5.2 on an i386 platform.
> 
> 
> Problem #1
> ----------
> When I first configured 2 RAID devices.  One had
> 3 components and no spares and one had 3 components
> and 1 hot spare.  I set them to autoconfigure (1 root,
> the other not).  I installed to the root one (which
> had no spare) and all was well.  Then I decided to
> reconfigure the root one to include a hot spare
> and now I'm almost bald :)   The following details
> the approaches I've taken to "kill" the configuration,
> some pretty severe.  Nothing works, even completely
> zeroing the raid components.  (BTW, I've written these
> steps after the fact, but I'm 99.9% sure they are what
> I did and in the same order):
> 
>    0. -- my initial setup --
>       vi raid0.conf      # create the configuration file
>                          #   NB: NO hot spare, the other
>                          #       components are sd0f, sd1f,
>                          #       and sd2f
>       vi raid1.conf      # create the configuration file
>                          #   NB: I've GOT a hot spare in
>                          #       this one (sd0e, sd1e, sd2e
>                          #       are active and sd3e is hot spare)
>       raidctl -C raid0.conf raid0  # configure raid0
>       raidctl -C raid1.conf raid1  # configure raid1
>       raidctl -s raid0   # check config .. looks good
>       raidctl -I 0 raid0 # initialize components in raid0
>       raidctl -I 0 raid1 # initialize components in raid1

Umm.... you should *REALLY* be using different serial numbers for these 
different RAID sets.

>       raidctl -iv raid0  # Initialize parity on raid0
>       raidctl -iv raid1  # Initialize parity on raid1
>       # NB: here I partitioned and formatted w/in the
>       #     raid device raid0 (but not raid1)
>       raidctl -A root raid0 # make raid0 device autoconfigure
>       raidctl -A yes raid1  # make raid1 device autoconfigure
>       reboot
>       raidctl -s raid0   # It comes back correctly
>       raidctl -s raid1   # It comes back correctly
> 
>       == OK, now I want a hot space on raid0 too ==
>
>    1. summary: I tried just to configure and reconfigure
>                on raid0 only
> 
>       raidctl -u raid0   # unconfigure the raid device
>       vi raid0.conf      # added the hot spare sd3f
>       raidctl -C raid0.conf raid0
>       raidctl -s raid0   # at this point the system shows
>                          #   3 components and 1 hot spare
>                          #   I think I'm done, but ...
>       raidctl -I 0 raid0 # for completeness
>       raidctl -iv raid0  # for completeness
>       raidctl -A yes raid0  # make raid device autoconfigure
>       raidctl -s raid0   # still looks good
>       reboot
>       raidctl -s raid0   # I've lost the hot spare!

One thing to note here (and it should be noted in the raidctl man-page if it 
already isn't) is that hot-spares don't autoconfigure.  The easiest way to 
make sure you have a hot-spare around for an autoconfig set is to add
a line like:
  
  raidctl -a /dev/sd3e raid0

to add sd3e as a hot-spare for raid0.

(I thought of having hot-spares autoconfig, but that requires writing a 
component label to them, and basically tying them to the RAID set...)

And since RAIDframe doesn't do autoconfig of hot spares, all of your 
other attempts ended up being in vain as well :(

[snip]
> So, how can I reconfigure the thing?  Just about the only
> thing I have left to try is to completely zero all the
> drives in the system.  Where is the autoconfiguration info
> being stored?

at block 32 (skipping blocks 0-31) of the RAID component.

> Problem #2
> ----------
> I tried powering off a drive in an active configuration
> (on raid1 if you've read the previous problem) which
> had 3 active components and 1 hot spare.  The RAIDframe
> driver detected the problem, marked the component bad,
> but did not start reconstruction to the hot spare.  I can
> manually fail the component (which is already marked as
> failed in the status) and then it starts reconstruction.

Right.  There is no automatic reconstruction in RAIDframe at this point.

> Although I can't find any statement of this capability
> in the docs, this was also true of hardware RAID systems
> I have used in the past (all of which automatically
> switched to the hot spare).  There is a statement
> in the raidctl man page under the "-F" option which reads
> "This is one of the mechanisms used to start the
> reconstruction process ...".  Since the difference between
> the "-f" and "-F" option is the use of the hot spare
> the capability obviously exists in the code (although
> perhaps it is disabled for some reason).  At any rate,
> since the whole point of the hot spare is to allow the
> RAID array to lose multiple disks before a human detects
> the problem and replaces the bad drive(s) I can't imagine
> that the system doesn't do it.

It doesn't.  (someday it might, and it'd be ~easy to do, but right now it 
doesn't..)  You can emulate 'hot failover' with a cron job or some other 
daemon that watches for a disk going bad, and then doing the 'raidctl -F'.
 
> So, what am I missing?  Do I need to somehow enable
> the system to use the hot spare?  Is there another
> option (similar to -A) which sets an option in
> the component labels?
> 
> Thanks for any help,

See above.  Sorry for the frustration.

Later...

Greg Oster