Re: Problem with raidframe under NetBSD-3 and NetBSD-4

To: buhrow%lothlorien.nfbcal.org@localhost (Brian Buhrow)
Subject: Re: Problem with raidframe under NetBSD-3 and NetBSD-4
From: Greg Oster <oster%cs.usask.ca@localhost>
Date: Fri, 04 Apr 2008 10:45:34 -0600

Brian Buhrow writes:
>       Hello.  I've been using raidframe with NetBSD for about 8 years with
> great success.  Just today, I've run into a situation I've never seen
> before, and I am not sure how to resolve it.
> 
>       I am running a NetBSD-3.1 system with a raid 1 mirrored route.  This
> system is a Sunfire X2200 with 2 SATA disks mirrored together.
>       All was working well, but I decided to image the system onto two new
> drives of a smaller size, due to the fact that I'm using 1TB drives for a
> system that requires less than 3GB of disk space.  In any case, the
> procedure I thought I'd use was the following:
> 
> 1.  Use raidctl -f /dev/wd1a raid0 to fail the second drive and allow me to
> pull it out of the system.
> 
> 2.  Insert the new disk and then label it and configure it as a second raid
> device with a missing component.
> 
> 3.  Dump/restore the running image onto the new drive.
> 
> 4.  Load the new drive into the new machine, and add a second drive to
> that.
> 
> 5.  Re-insert the original drive from step 1 and rebuild the raid to it.
> 
>       I got as far as step 5, but than ran into trouble.
> 
>       Due to some canoodling around on my part, it was necessary to reboot
> the system a few times during these steps, so I ended up with a raid set
> that looks like:
> 
> Components:
>           component0: failed
>            /dev/wd0a: optimal
> No spares.
> component0 status is: failed.  Skipping label.
> Component label for /dev/wd0a:
>    Row: 0, Column: 1, Num Rows: 1, Num Columns: 2
>    Version: 2, Serial Number: 20071030, Mod Counter: 135
>    Clean: No, Status: 0
>    sectPerSU: 64, SUsPerPU: 1, SUsPerRU: 1
>    Queue size: 100, blocksize: 512, numBlocks: 1953524992
>    RAID Level: 1
>    Autoconfig: Yes
>    Root partition: Yes
>    Last configured as: raid0
> Parity status: DIRTY
> Reconstruction is 100% complete.
> Parity Re-write is 100% complete.
> Copyback is 100% complete.
> 
>       "No problem", I thought.   I'll do:
> 
> raidctl -a /dev/wd1a raid0
> <Works fine, the second disk shows up as a hot spare>
> 
> raidctl -F component0 raid0
> <Here's where the trouble starts>
> 
>       What happens after this command is run is just strange.
> First, all seems OK, I get the messages from the kernel about starting
> reconstruction.  Then, a check of 
> raidctl -S
> shows that parity, copyback and reconstruction are all 100% complete.
> (This mere seconds after runing the reconstruction command above.)

Could you send me any raid/disk-related contents of /var/log/messages 
from around these times?

>       At this point, I notice that the system is starting to shut down,

You mean "shut down" as in "start acting strangely and beginning to 
stop responding", not as in "starting to reboot...", right?

> and
> that anything needing disk i/o hangs.  The raidframe -s command hangs on
> the ps string raidframe.  The only way to recover  is to power cycle the
> machine.
>
>       I've tried a NetBSD-4 kernel with this setup, and it demonstrates the
> same behavior.  I've verified that both sata ports are working properly and
> at full speed.  I've also verified that both drives are in working order.
>       What this seems like is that the still working drive of the raid got
> corrupted in some way that fsck doesn't detect, but that raidframe
> absolutely doesn't like. 

RAIDframe only reads/writes blocks -- it knows nothing about whether 
things are corrupted, other than if it gets an IO error from an 
underlying device...

> I'm willing to believe I made an error in my
> procedure which is shooting me in the foot,

What you describe should work fine...

> but I'll be darned if I can
> figure out how to recover from this situation.
>       Has anyone else seen this problem, and if so, does anyone have
> suggestions about what I might do to get my drives mirrored again?
>       If it helps, I've got several other X2200 machines running with
> similar configurations, and they're happy to build drives all day long
> without a hitch.

"build drives" in what way?  A reconstruct is significantly different 
from a "parity rewrite" in terms of the IO load -- the reconstruct is 
going to be reading one disk and writing to another, while the 
rewrite is going to be primarily just reading from both disks (and 
writing only when it has to).

>  I don't get any errors from the disk drivers saying they
> can't read or write a drive, and reading and writing both drives is working
> fine in other contexts.  So, I think there's something woing with the good
> mirror, but I'm not sure what.
> Any thoghts?

Oh.. hmmmmmmm.... 

requoting:
>    Queue size: 100, blocksize: 512, numBlocks: 1953524992

I wonder with numBlocks that high if we're running into a 32-bit 
overflow issue somewhere....  Likely related to the number of 
stripes...  Hopefully your /var/log/messages has some additional 
information that might point us in the right direction..

Later...

Greg Oster

References:
- Problem with raidframe under NetBSD-3 and NetBSD-4
  - From: Brian Buhrow

Prev by Date: Re: -4.0, cannot build kernel
Next by Date: Re: Problem with raidframe under NetBSD-3 and NetBSD-4
Previous by Thread: Problem with raidframe under NetBSD-3 and NetBSD-4
Next by Thread: Re: Problem with raidframe under NetBSD-3 and NetBSD-4
Indexes:

Home | Main Index | Thread Index | Old Index