NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/38856: Raidframe reconstruction locks my machine up



>Number:         38856
>Category:       kern
>Synopsis:       Raidframe reconstruction locks my machine up
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Jun 04 13:45:00 +0000 2008
>Originator:     Martin Husemann
>Release:        NetBSD 4.99.64
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD night-porter.duskware.de 4.99.64 NetBSD 4.99.64 (PORTER) #39: 
Wed Jun 4 08:56:53 CEST 2008 
martin%night-porter.duskware.de@localhost:/usr/src/sys/arch/i386/compile/PORTER 
i386
Architecture: i386
Machine: i386
>Description:

After a disk failure on a raid1 volume I decided to get more space and bought
two disks instead of one. I created a new raid set on one of the new drives,
copied everything over, replaced the half-raidset disk with the other new one
and now am stuck with reconstructing the set to the empty new disk.

Soon (sometimes imediately) after I start reconstruction, the machine locks
up completely, I can't even break into ddb on the console. This happens within
less than 10 minutes reliably on the affected raid set. The other raid 
(consisting of slightly slower disks, wd0 and wd1, see dmesg below) can
be reconstructed. I saw the lockup there too once, but can't reproduce this.

Since this started happening after a hardware change, of course I initially
suspected hardware. So I have replaced the PSU and tested the affected disks
(wd2 and wd3) as hard as I could, but I am unable to cause any problem w/o
the raid reconstruction.

After talking to Greg Oster I downgraded the raidframe code to May 18 (before
the recent reconstruction changes), but the problem happens with the old
code too. This seems timing related, as a DEBUG/LOCKDEBUG/DIAGNOSTIC kernel
seems to take longe to lock up.

Here are the raid params:

Components:
           /dev/wd2a: optimal
          component1: failed
No spares.
Component label for /dev/wd2a:
   Row: 0, Column: 0, Num Rows: 1, Num Columns: 2
   Version: 2, Serial Number: 1166836002, Mod Counter: 112
   Clean: No, Status: 0
   sectPerSU: 128, SUsPerPU: 1, SUsPerRU: 1
   Queue size: 100, blocksize: 512, numBlocks: 490350592
   RAID Level: 1
   Autoconfig: Yes
   Root partition: No
   Last configured as: raid1
component1 status is: failed.  Skipping label.
Parity status: DIRTY
Reconstruction is 100% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.

and here is full dmesg:

Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
    2006, 2007, 2008
    The NetBSD Foundation, Inc.  All rights reserved.
Copyright (c) 1982, 1986, 1989, 1991, 1993
    The Regents of the University of California.  All rights reserved.

NetBSD 4.99.64 (PORTER) #39: Wed Jun  4 08:56:53 CEST 2008
        
martin%night-porter.duskware.de@localhost:/usr/src/sys/arch/i386/compile/PORTER
total memory = 3071 MB
avail memory = 3009 MB
timecounter: Timecounters tick every 10.000 msec
timecounter: Timecounter "i8254" frequency 1193182 Hz quality 100
System Manufacturer System Name (System Version)
mainbus0 (root)
cpu0 at mainbus0 apid 3: Intel 686-class, 1000MHz, id 0x686
cpu1 at mainbus0 apid 0: Intel 686-class, 1000MHz, id 0x686
ioapic0 at mainbus0 apid 2: pa 0xfec00000, version 11, 16 pins
ioapic1 at mainbus0 apid 3: pa 0xfec01000, version 11, 16 pins
acpi0 at mainbus0: Intel ACPICA 20080321
acpi0: X/RSDT: OemId <ASUS  ,CUR-DLS ,30303031>, AslId <MSFT,31313031>
LNKR: ACPI: Found matching pin for 0.15.INTA at func 2: 9
acpi0: SCI interrupting at int 9
acpi0: fixed-feature power button present
timecounter: Timecounter "ACPI-Safe" frequency 3579545 Hz quality 900
ACPI-Safe 32-bit timer
PWRB (PNP0C0C) at acpi0 not configured
FPU (PNP0C04) at acpi0 not configured
TMR (PNP0100) at acpi0 not configured
SPKR (PNP0800) at acpi0 not configured
KBC (PNP0303) at acpi0 not configured
MOUE (PNP0F13) at acpi0 not configured
COMA (PNP0501) at acpi0 not configured
COMB (PNP0501) at acpi0 not configured
FDC (PNP0700) at acpi0 not configured
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pchb0 at pci0 dev 0 function 0
pchb0: vendor 0x1166 product 0x0009 (rev. 0x05)
pchb1 at pci0 dev 0 function 1
pchb1: vendor 0x1166 product 0x0009 (rev. 0x05)
pci1 at pchb1 bus 1
pci1: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
wm0 at pci1 dev 2 function 0: Intel i82543GC 1000BASE-T Ethernet, rev. 2
wm0: interrupting at ioapic1 pin 5
wm0: 64-bit 33MHz PCI bus
wm0: 64 word (6 address bits) MicroWire EEPROM
wm0: Ethernet address 00:03:47:25:42:14
makphy0 at wm0 phy 1: Marvell 88E1000 Gigabit PHY, rev. 2
makphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
1000baseT-FDX, auto
puc0 at pci1 dev 4 function 0: Titan PCI-800L (com, com, com, com, com, com, 
com, com)
com2 at puc0 port 0: interrupting at ioapic1 pin 7
com2: ns16550a, working fifo
com3 at puc0 port 1: interrupting at ioapic1 pin 7
com3: ns16550a, working fifo
com4 at puc0 port 2: interrupting at ioapic1 pin 7
com4: ns16550a, working fifo
com5 at puc0 port 3: interrupting at ioapic1 pin 7
com5: ns16550a, working fifo
com6 at puc0 port 4: interrupting at ioapic1 pin 7
com6: ns16550a, working fifo
com7 at puc0 port 5: interrupting at ioapic1 pin 7
com7: ns16550a, working fifo
com8 at puc0 port 6: interrupting at ioapic1 pin 7
com8: ns16550a, working fifo
com9 at puc0 port 7: interrupting at ioapic1 pin 7
com9: ns16550a, working fifo
vendor 0x1000 product 0x0020 (SCSI mass storage, revision 0x01) at pci1 dev 5 
function 0 not configured
vendor 0x1000 product 0x0020 (SCSI mass storage, revision 0x01) at pci1 dev 5 
function 1 not configured
fxp0 at pci0 dev 2 function 0: i82559 Ethernet, rev 8
fxp0: interrupting at ioapic1 pin 4
fxp0: Ethernet address 00:e0:18:04:e1:cd
inphy0 at fxp0 phy 1: i82555 10/100 media interface, rev. 4
inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
satalink0 at pci0 dev 6 function 0
satalink0: Silicon Image SATALink 3114 (rev. 0x02)
satalink0: 33MHz PCI bus
satalink0: bus-master DMA support present
satalink0: using ioapic1 pin 3 for native-PCI interrupt
atabus0 at satalink0 channel 0
atabus1 at satalink0 channel 1
atabus2 at satalink0 channel 2
atabus3 at satalink0 channel 3
vga0 at pci0 dev 7 function 0: vendor 0x1002 product 0x4752 (rev. 0x27)
wsdisplay0 at vga0 kbdmux 1: console (80x25, vt100 emulation)
wsmux1: connecting to wsdisplay0
drm at vga0 not configured
piixpm0 at pci0 dev 15 function 0
piixpm0: vendor 0x1166 product 0x0200 (rev. 0x50)

iic0 at piixpm0: I2C bus
rccide0 at pci0 dev 15 function 1
rccide0: ServerWorks OSB4 IDE Controller (rev. 0x00)
rccide0: bus-master DMA support present
rccide0: primary channel configured to compatibility mode
rccide0: primary channel interrupting at ioapic0 pin 14
atabus4 at rccide0 channel 0
rccide0: secondary channel configured to compatibility mode
rccide0: secondary channel interrupting at ioapic0 pin 15
atabus5 at rccide0 channel 1
ohci0 at pci0 dev 15 function 2: vendor 0x1166 product 0x0220 (rev. 0x04)
ohci0: interrupting at ioapic0 pin 9
ohci0: OHCI version 1.0, legacy support
usb0 at ohci0: USB revision 1.0
isa0 at mainbus0
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0: console keyboard, using wsdisplay0
attimer0 at isa0 port 0x40-0x43: AT Timer
npx0 at isa0 port 0xf0-0xff
npx0: reported by CPUID; using exception 16
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
timecounter: Timecounter "clockinterrupt" frequency 100 Hz quality 0
satalink0: port 0: device present, speed: 1.5Gb/s
wd0 at atabus0 drive 0satalink0: port 1: device present, speed: 1.5Gb/s
satalink0: port 2: device present, speed: 1.5Gb/s
satalink0: port 3: device present, speed: 1.5Gb/s
: <WDC WD2000JD-00GBB0>
wd0: drive supports 16-sector PIO transfers, LBA48 addressing
wd0: 186 GB, 387621 cyl, 16 head, 63 sec, 512 bytes/sect x 390721968 sectors
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
wd0: 32-bit data port
wd0: drive supports PIO mode 4uhub0 at usb0: vendor 0x1166 OHCI root hub, class 
9/0, rev 1.00/1.00, addr 1
uhub0: 4 ports with 4 removable, self powered
, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd0(satalink0:0:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA)
wd1 at atabus1 drive 0: <WDC WD2000JD-00GBB0>
wd1: drive supports 16-sector PIO transfers, LBA48 addressing
wd1: 186 GB, 387621 cyl, 16 head, 63 sec, 512 bytes/sect x 390721968 sectors
wd1: 32-bit data port
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd1(satalink0:1:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA)
wd2 at atabus2 drive 0: <WDC WD2502ABYS-01B7A0>
wd2: drive supports 16-sector PIO transfers, LBA48 addressing
wd2: 233 GB, 486459 cyl, 16 head, 63 sec, 512 bytes/sect x 490350672 sectors
wd2: 32-bit data port
wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd2(satalink0:2:0): using PIO mode 4, Ultra-DMA mode 6 (Ultra/133) (using DMA)
wd3 at atabus3 drive 0: <WDC WD2502ABYS-01B7A0>
wd3: drive supports 16-sector PIO transfers, LBA48 addressing
wd3: 233 GB, 486459 cyl, 16 head, 63 sec, 512 bytes/sect x 490350672 sectors
wd3: 32-bit data port
wd3: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd3(satalink0:3:0): using PIO mode 4, Ultra-DMA mode 6 (Ultra/133) (using DMA)
atapibus0 at atabus4: 2 targets
cd0 at atapibus0 drive 0: <HL-DT-ST DVDRAM GSA-4040B, K2I38IJ5950, A300> cdrom 
removable
cd0: 32-bit data port
cd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 2 (Ultra/33)
cd0(rccide0:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 2 (Ultra/33) 
(using DMA)
uhub1 at uhub0 port 1: vendor 0x0451 product 0x1446, class 9/0, rev 1.10/1.10, 
addr 2
uhub1: 4 ports with 4 removable, self powered
uplcom0 at uhub1 port 1
uplcom0: Prolific Technology Inc. USB-Serial Controller, rev 1.10/3.00, addr 3
ucom0 at uplcom0
uplcom1 at uhub1 port 2
uplcom1: Prolific Technology Inc. USB-Serial Controller, rev 1.10/3.00, addr 4
ucom1 at uplcom1
Kernelized RAIDframe activated
raid0: RAID Level 1
raid0: Components: /dev/wd0a /dev/wd1a
raid0: Total Sectors: 389671552 (190269 MB)
raid1: RAID Level 1
raid1: Components: /dev/wd2a component1[**FAILED**]
raid1: Total Sectors: 490350592 (239429 MB)
boot device: raid0
root on raid0a dumps on raid0b
root file system type: ffs
wsdisplay0: screen 1 added (80x25, vt100 emulation)
wsdisplay0: screen 2 added (80x25, vt100 emulation)
wsdisplay0: screen 3 added (80x25, vt100 emulation)
wsdisplay0: screen 4 added (80x25, vt100 emulation)


>How-To-Repeat:
s/a

>Fix:
n/a



Home | Main Index | Thread Index | Old Index