Subject: Two RAIDFrame Problems -- Reconfigure + Hot Spare
To: None <netbsd-users@netbsd.org>
From: Rob Ginn <rob@sun701.nawcad.navy.mil>
List: netbsd-users
Date: 05/21/2002 16:17:17
Hi,
I'm having two problems with the RAIDFrame system
under NetBSD.  The first deals with changing the
configuration of a RAID device, the second with the
operation of the hot spare.  I've spent a week on it
and any help would be much appreciated.  I'm running
NetBSD 1.5.2 on an i386 platform.


Problem #1
----------
When I first configured 2 RAID devices.  One had
3 components and no spares and one had 3 components
and 1 hot spare.  I set them to autoconfigure (1 root,
the other not).  I installed to the root one (which
had no spare) and all was well.  Then I decided to
reconfigure the root one to include a hot spare
and now I'm almost bald :)   The following details
the approaches I've taken to "kill" the configuration,
some pretty severe.  Nothing works, even completely
zeroing the raid components.  (BTW, I've written these
steps after the fact, but I'm 99.9% sure they are what
I did and in the same order):

   0. -- my initial setup --
      vi raid0.conf      # create the configuration file
                         #   NB: NO hot spare, the other
                         #       components are sd0f, sd1f,
                         #       and sd2f
      vi raid1.conf      # create the configuration file
                         #   NB: I've GOT a hot spare in
                         #       this one (sd0e, sd1e, sd2e
                         #       are active and sd3e is hot spare)
      raidctl -C raid0.conf raid0  # configure raid0
      raidctl -C raid1.conf raid1  # configure raid1
      raidctl -s raid0   # check config .. looks good
      raidctl -I 0 raid0 # initialize components in raid0
      raidctl -I 0 raid1 # initialize components in raid1
      raidctl -iv raid0  # Initialize parity on raid0
      raidctl -iv raid1  # Initialize parity on raid1
      # NB: here I partitioned and formatted w/in the
      #     raid device raid0 (but not raid1)
      raidctl -A root raid0 # make raid0 device autoconfigure
      raidctl -A yes raid1  # make raid1 device autoconfigure
      reboot
      raidctl -s raid0   # It comes back correctly
      raidctl -s raid1   # It comes back correctly

      == OK, now I want a hot space on raid0 too ==

   1. summary: I tried just to configure and reconfigure
               on raid0 only

      raidctl -u raid0   # unconfigure the raid device
      vi raid0.conf      # added the hot spare sd3f
      raidctl -C raid0.conf raid0
      raidctl -s raid0   # at this point the system shows
                         #   3 components and 1 hot spare
                         #   I think I'm done, but ...
      raidctl -I 0 raid0 # for completeness
      raidctl -iv raid0  # for completeness
      raidctl -A yes raid0  # make raid device autoconfigure
      raidctl -s raid0   # still looks good
      reboot
      raidctl -s raid0   # I've lost the hot spare!

   2. summary: As with #1, but first I disabled autoconfig
               and zeroed the start of the raid device

      dd if=/dev/zero of=/dev/rraid0 bs=100b count=10
      raidctl -A no raid0   # disable autoconfig
      raidctl -u raid0   # unconfigure the raid device
      vi raid0.conf      # for completeness (unchanged from
                         #   last time. includes hot spare)
      raidctl -C raid0.conf raid0
      raidctl -s raid0   # at this point the system shows
                         #   3 components and 1 hot spare
                         #   I think I'm done, but ...
      raidctl -I 0 raid0 # for completeness
      raidctl -iv raid0  # for completeness
      raidctl -A yes raid0    # make raid device autoconfigure
      reboot
      raidctl -s raid0   # I've lost the hot spare again!

   3. summary: I disabled autoconfig, zeroed the beginning
               of the raid device, and zeroed the components
               completely.  It still loses the hot spare!

      dd if=/dev/zero of=/dev/rraid0 bs=100b count=10
      raidctl -A no raid0   # disable autoconfig
      raidctl -u raid0   # unconfigure the raid device
      # completely zero the components
      dd if=/dev/zero of=/dev/rsd0f bs=100b
      dd if=/dev/zero of=/dev/rsd1f bs=300b
      dd if=/dev/zero of=/dev/rsd2f bs=300b
      dd if=/dev/zero of=/dev/rsd3f bs=300b # even the spare!
      vi raid0.conf      # for completeness (unchanged from
                         #   last time. includes hot spare)
      raidctl -C raid0.conf raid0
      raidctl -s raid0   # at this point the system shows
                         #   3 components and 1 hot spare
                         #   I think I'm done, but ...
      raidctl -I 0 raid0 # for completeness
      raidctl -iv raid0  # for completeness
      raidctl -A yes raid0    # make raid device autoconfigure
      reboot
      raidctl -s raid0   # I've lost the hot spare!

So, how can I reconfigure the thing?  Just about the only
thing I have left to try is to completely zero all the
drives in the system.  Where is the autoconfiguration info
being stored?


Problem #2
----------
I tried powering off a drive in an active configuration
(on raid1 if you've read the previous problem) which
had 3 active components and 1 hot spare.  The RAIDframe
driver detected the problem, marked the component bad,
but did not start reconstruction to the hot spare.  I can
manually fail the component (which is already marked as
failed in the status) and then it starts reconstruction.

Although I can't find any statement of this capability
in the docs, this was also true of hardware RAID systems
I have used in the past (all of which automatically
switched to the hot spare).  There is a statement
in the raidctl man page under the "-F" option which reads
"This is one of the mechanisms used to start the
reconstruction process ...".  Since the difference between
the "-f" and "-F" option is the use of the hot spare
the capability obviously exists in the code (although
perhaps it is disabled for some reason).  At any rate,
since the whole point of the hot spare is to allow the
RAID array to lose multiple disks before a human detects
the problem and replaces the bad drive(s) I can't imagine
that the system doesn't do it.

So, what am I missing?  Do I need to somehow enable
the system to use the hot spare?  Is there another
option (similar to -A) which sets an option in
the component labels?

Thanks for any help,
Rob Ginn