NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: RaidFrame Raid-1 problem (can't ditch a failing disk)



On Sun, 28 Feb 2010 01:49:53 -0500
Louis Guillaume <louis%zabrico.com@localhost> wrote:

> On 2/26/10 8:39 AM, Greg Oster wrote:
> > On Fri, 26 Feb 2010 01:12:12 -0500
> > Louis Guillaume<louis%zabrico.com@localhost>  wrote:
> >
> >> Hi!
> >>
> >> I have a strange problem replacing a drive from a RAID-1 RaidFrame set.
> >> Here's some info:
> >>
> >> # uname -mrs
> >> NetBSD 5.0_STABLE i386
> >>
> >> # raidctl -s raid0
> >> Components:
> >>              /dev/sd0a: failed
> >>              /dev/sd1a: optimal
> >> No spares.
> >> /dev/sd0a status is: failed.  Skipping label.
> >> Component label for /dev/sd1a:
> >>      Row: 0, Column: 1, Num Rows: 1, Num Columns: 2
> >>      Version: 2, Serial Number: 20071216, Mod Counter: 280
> >>      Clean: No, Status: 0
> >>      sectPerSU: 128, SUsPerPU: 1, SUsPerRU: 1
> >>      Queue size: 100, blocksize: 512, numBlocks: 143638784
> >>      RAID Level: 1
> >>      Autoconfig: Yes
> >>      Root partition: Yes
> >>      Last configured as: raid0
> >> Parity status: DIRTY
> >> Reconstruction is 100% complete.
> >> Parity Re-write is 100% complete.
> >> Copyback is 100% complete.
> >>
> >> # dmesg | grep sd0
> >> sd0 at scsibus0 target 0 lun 0:<ModusLnk, ,>  disk fixed
> >> sd0: 70136 MB, 78753 cyl, 2 head, 911 sec, 512 bytes/sect x 143638992
> >> sectors
> >> sd0: sync (12.50ns offset 62), 16-bit (160.000MB/s) transfers, tagged
> >> queueing
> >> raid0: Components: /dev/sd0a[**FAILED**] /dev/sd1a
> >>
> >> # grep smartd.*sd0d /var/log/messages |tail -3
> >> Feb 26 00:43:04 thoth smartd[296]: Device: /dev/sd0d, opened
> >> Feb 26 00:43:04 thoth smartd[296]: Device: /dev/sd0d, is SMART capable.
> >> Adding to "monitor" list.
> >> Feb 26 00:43:04 thoth smartd[296]: Device: /dev/sd0d, SMART Failure:
> >> HARDWARE IMPENDING FAILURE TOO MANY BLOCK REASSIGNS
> >>
> >>
> >>
> >> So we got a bad disk and I have to change it out. So I did the following:
> 
> >> Any help would be great!
> >
> > Since what you're doing seems to be correct, I think we'e going to need
> > a photo or backtrace or whatever of the panic in order to figure out
> > what's gone wrong :(
> >
> > Later...
> >
> > Greg Oster
> 
> 
> Ok - I was afraid of that. The problem is 100% reproducible, though, so 
> it's easy to do. Here are the screenshots:
> 
> ftp://zabrico.com/pub/RaidFrame-Panic-0.jpeg
> ftp://zabrico.com/pub/RaidFrame-Panic-1.jpeg
> 
> In this case, I had removed the failing drive, so we have sd0 on 
> scsibus1. This drive normally shows up as sd1 on scsibus1, but IIRC that 
> doesn't matter to RaidFrame, right? 

Nope.

> At any rate, the same thing happens 
> with a new blank (identical) disk in scsibus0.

The issue is that the rf_parity_map.c bits arn't checking to see if a
component is valid before attempting to get (and then use!) a component
label.  I'll see if I can whip up a patch for you to test, if Jed
doesn't beat me to it...  (I missed this issue when I was looking over
the paritymap stuff... :( )

This will also be a critical patch to get pulled up for NetBSD 5.1 as
well...

Later...

Greg Oster


Home | Main Index | Thread Index | Old Index