kern/42904: RaidFrame panic after removal of RAID-1 member

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/42904: RaidFrame panic after removal of RAID-1 member
From: louis%thoth.zabrico.com@localhost
Date: Sun, 28 Feb 2010 20:40:00 +0000 (UTC)

>Number:         42904
>Category:       kern
>Synopsis:       after removal of a failing RaidFrame RAID-1 member, netbsd 
>panics
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Feb 28 20:40:00 +0000 2010
>Originator:     Louis Guillaume
>Release:        NetBSD 5.0_STABLE
>Organization:
>Environment:
System: NetBSD xxx.xxx.xxx 5.0_STABLE NetBSD 5.0_STABLE (GENERIC) #13: Wed Dec 
30 14:39:00 EST 2009 
louis%xx.xx.xxx@localhost:/usr/obj/sys/arch/i386/compile/GENERIC i386
Architecture: i386
Machine: i386
>Description:
First some background on our setup...

# raidctl -s raid0
Components:
           /dev/sd0a: failed
           /dev/sd1a: optimal
No spares.
/dev/sd0a status is: failed.  Skipping label.
Component label for /dev/sd1a:
   Row: 0, Column: 1, Num Rows: 1, Num Columns: 2
   Version: 2, Serial Number: 20071216, Mod Counter: 280
   Clean: No, Status: 0
   sectPerSU: 128, SUsPerPU: 1, SUsPerRU: 1
   Queue size: 100, blocksize: 512, numBlocks: 143638784
   RAID Level: 1
   Autoconfig: Yes
   Root partition: Yes
   Last configured as: raid0
Parity status: DIRTY
Reconstruction is 100% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.

# dmesg | grep sd0
sd0 at scsibus0 target 0 lun 0: <ModusLnk, , > disk fixed
sd0: 70136 MB, 78753 cyl, 2 head, 911 sec, 512 bytes/sect x 143638992 sectors
sd0: sync (12.50ns offset 62), 16-bit (160.000MB/s) transfers, tagged queueing
raid0: Components: /dev/sd0a[**FAILED**] /dev/sd1a

# grep smartd.*sd0d /var/log/messages |tail -3
Feb 26 00:43:04 thoth smartd[296]: Device: /dev/sd0d, opened
Feb 26 00:43:04 thoth smartd[296]: Device: /dev/sd0d, is SMART capable. Adding 
to "monitor" list.
Feb 26 00:43:04 thoth smartd[296]: Device: /dev/sd0d, SMART Failure: HARDWARE 
IMPENDING FAILURE TOO MANY BLOCK REASSIGNS

So we got a bad disk and I have to change it out. So I did the following:

  o failed the component with "raidctl -f /dev/sd0a raid0"
  o shut down
  o replaced the disk
  o rebooted
  o Now the system panics right after raidframe initializes.
    Screen shots can be found at...
    
    ftp://zabrico.com/pub/RaidFrame-Panic-0.jpeg
    ftp://zabrico.com/pub/RaidFrame-Panic-1.jpeg
    
    In this case, I had removed the failing drive, so we have sd0 on
    scsibus1. This drive normally shows up as sd1 on scsibus1, but that
    shouldn't matter to RaidFrame. At any rate, the same thing happens
    with a new blank (identical) disk in scsibus0.

  o power off
  o replace the "bad" sd0
  o machine boots as normal

>How-To-Repeat:
Not sure if this will be repeatable on other raidframe machines, but here's 
what causes
it to happen:

  o Set up a RAID-1 device
  o Fail one component with "raidctl -f /dev/xx0a raid0"
  o shut down
  o remove the failed component
  o start up
  o system panics right after, "Kernelized RaidFrame activated".

>Fix:
  See Greg Oster's analysis in this thread...
  http://mail-index.netbsd.org/netbsd-users/2010/02/26/msg005746.html
  
  not sure if the actual fix is there but...

Prev by Date: NetBSD Nightly Trouble Ticket Report
Next by Date: port-xen/42903: writing to msdos fs cause NetBSD restart immediately
Previous by Thread: lib/42900: pthread_create(3) deadlock in pthread_atfork(3)'s handler
Next by Thread: port-xen/42903: writing to msdos fs cause NetBSD restart immediately
Indexes:

Home | Main Index | Thread Index | Old Index