NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Problem with raidframe under NetBSD-3 and NetBSD-4



        Hello.  I've been using raidframe with NetBSD for about 8 years with
great success.  Just today, I've run into a situation I've never seen
before, and I am not sure how to resolve it.

        I am running a NetBSD-3.1 system with a raid 1 mirrored route.  This
system is a Sunfire X2200 with 2 SATA disks mirrored together.
        All was working well, but I decided to image the system onto two new
drives of a smaller size, due to the fact that I'm using 1TB drives for a
system that requires less than 3GB of disk space.  In any case, the
procedure I thought I'd use was the following:

1.  Use raidctl -f /dev/wd1a raid0 to fail the second drive and allow me to
pull it out of the system.

2.  Insert the new disk and then label it and configure it as a second raid
device with a missing component.

3.  Dump/restore the running image onto the new drive.

4.  Load the new drive into the new machine, and add a second drive to
that.

5.  Re-insert the original drive from step 1 and rebuild the raid to it.

        I got as far as step 5, but than ran into trouble.

        Due to some canoodling around on my part, it was necessary to reboot
the system a few times during these steps, so I ended up with a raid set
that looks like:

Components:
          component0: failed
           /dev/wd0a: optimal
No spares.
component0 status is: failed.  Skipping label.
Component label for /dev/wd0a:
   Row: 0, Column: 1, Num Rows: 1, Num Columns: 2
   Version: 2, Serial Number: 20071030, Mod Counter: 135
   Clean: No, Status: 0
   sectPerSU: 64, SUsPerPU: 1, SUsPerRU: 1
   Queue size: 100, blocksize: 512, numBlocks: 1953524992
   RAID Level: 1
   Autoconfig: Yes
   Root partition: Yes
   Last configured as: raid0
Parity status: DIRTY
Reconstruction is 100% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.

        "No problem", I thought.   I'll do:

raidctl -a /dev/wd1a raid0
<Works fine, the second disk shows up as a hot spare>

raidctl -F component0 raid0
<Here's where the trouble starts>

        What happens after this command is run is just strange.
First, all seems OK, I get the messages from the kernel about starting
reconstruction.  Then, a check of 
raidctl -S
shows that parity, copyback and reconstruction are all 100% complete.
(This mere seconds after runing the reconstruction command above.)

        At this point, I notice that the system is starting to shut down, and
that anything needing disk i/o hangs.  The raidframe -s command hangs on
the ps string raidframe.  The only way to recover  is to power cycle the
machine.
        I've tried a NetBSD-4 kernel with this setup, and it demonstrates the
same behavior.  I've verified that both sata ports are working properly and
at full speed.  I've also verified that both drives are in working order.
        What this seems like is that the still working drive of the raid got
corrupted in some way that fsck doesn't detect, but that raidframe
absolutely doesn't like.  I'm willing to believe I made an error in my
procedure which is shooting me in the foot, but I'll be darned if I can
figure out how to recover from this situation.
        Has anyone else seen this problem, and if so, does anyone have
suggestions about what I might do to get my drives mirrored again?
        If it helps, I've got several other X2200 machines running with
similar configurations, and they're happy to build drives all day long
without a hitch.  I don't get any errors from the disk drivers saying they
can't read or write a drive, and reading and writing both drives is working
fine in other contexts.  So, I think there's something woing with the good
mirror, but I'm not sure what.
Any thoghts?
-thanks
-Brian



Home | Main Index | Thread Index | Old Index