Subject: Re: raidframe re-mirroring (cont'd)
To: Greg Oster <oster@cs.usask.ca>
From: Louis Guillaume <lguillaume@berklee.edu>
List: current-users
Date: 08/13/2004 14:05:02
Greg Oster wrote:

> Louis Guillaume writes:
> 
>>Hi Everyone,
>>
>>I posted a few weeks ago about a problem I had with a raid set, where 
>>one disk was failed and I wanted to bring it back online. Here's what 
>>happened...
>>
>>. Booted into single-user
>>
>>. Rebuilt all arrays on the pair of disks: raid0 raid1 raid2 raid3 raid4 
>>- all raid-1. It's set up like this...
>>
>>#############################
>>raid0 raid1 raid2 raid3 raid4
>>
>>wd0a  wd0e  wd0f  wd0g  wd0b
>>wd1a  wd1e  wd1f  wd1g  wd1b
>>
>>/     /usr  /var  /home swap
>>#############################
>>
>>. fsck-ed all filesystems. reboot
>>
>>Immediately, I noticed apache2 and spamass-milter fail during startup 
>>(recently built from pkgsrc and very reliable). Immediatiely!
> 
> 
> How do they fail?  What do they do/not do? (i.e. what is the nature 
> of the error?)
> 

They just die from Segmentation faults and dump cores when their rc 
scripts run. It's not reliable, however, for example sometimes apache 
won't dump it's core but still won't start.

> 
>>This is 
>>what caused me to believe the second disk was bad in the first place.
>>
>>Now I believed that the disk was actually bad and not the kernel/raidframe.
>>
>>. Rebooted back to single user.
>>. Failed all wd1 raid components.
>>. fsck (finds and fixes errors) and reboot again.
>>
>>All is well! For a week and a half, not a hitch.
>>
>>More reason to believe it's the disk.
>>
>>. Replace suspect disk with another one, disklabeled raidctl -a ...etc.
>>
>>. Incorporated new spare components into arrays.
>>
>>. rebooted. raidctl -F ... , fsck , reboot.
>>
>>SAME FAILURES as before!! Apache2 and spamass-milter are the first to 
>>go. In the past I had not noticed these right away and kept running.
>>
>>This is very strange. I'd really like to get my redundancy back. But 
>>once again, I'm running on a set of single-component raid-1 arrays.
>>
>>Here is some other information that may be useful...
>>
>>Machine - i386
>>Problem first noticed at NetBSD-2.0E GENERIC.MP kernel
>>Still a problem at NetBSD-2.0G GENERIC.MP kernel
>>
>>I'm guessing my disk is good. The machine runs great on one disk. Weeks 
>>of uptime - even months without a peep. So I'm not thinking that there's 
>>a memory problem as someone suggested earlier.
>>
>>The only other thing I can think of is perhaps the ribbon cable from the 
>>board to the disk. But if that was bad, wouldn't we have much more 
>>obvious issues?
>>
>>I don't know if this is a config problem, or something else. But there 
>>definitely is a strange problem that's preventing me from mirroring 
>>successfully.
>>
>>Perhaps too many raid devices on one pair of disks?
> 
> 
> No.
> 
> 
>>Maybe problems with MP kernel and raidframe?
> 
> 
> Not supposed to be.  I havn't seen anything here that would suggest 
> that... 
>  
> 
>>Any help would be great. Please let me know if I can provide more 
>>information.
> 
> 
> The apache/milter errors would be useful.  RAID config files and a 
> 'dmesg' output would also help.
> 

The processes die so quickly there doesn't seem to be any printed or 
logged errors. Not sure where to look for this stuff.

Here's a typical raid config used to make each array...

$ cat /etc/raid0.conf
START array
1 2 0

START disks
/dev/wd0a
/dev/wd9a

START layout
128 1 1 1

START queue
fifo 100

... naturally wd9x was replace with wd1x


Here's dmesg...

NetBSD 2.0G (GENERIC.MP) #7: Fri Jul  2 18:26:58 EDT 2004
         louis@shodo.berklee.net:/usr/obj/sys/arch/i386/compile/GENERIC.MP
total memory = 255 MB
avail memory = 242 MB
BIOS32 rev. 0 found at 0xfdba0
mainbus0 (root)
mainbus0: Intel MP Specification (Version 1.4) (AMI      CNB30LE     )
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel Pentium III (686-class), 996.90 MHz, id 0x68a
cpu0: features 387fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features 387fbff<PGE,MCA,CMOV,PAT,PSE36,PN,MMX>
cpu0: features 387fbff<FXSR,SSE>
cpu0: I-cache 16 KB 32b/line 4-way, D-cache 16 KB 32b/line 4-way
cpu0: L2 cache 256 KB 32b/line 8-way
cpu0: ITLB 32 4 KB entries 4-way, 2 4 MB entries fully associative
cpu0: DTLB 64 4 KB entries 4-way, 8 4 MB entries 4-way
cpu0: serial number 0000-068A-0002-67F6-059B-2626
cpu0: calibrating local timer
cpu0: apic clock running at 132 MHz
cpu0: 8 page colors
cpu1 at mainbus0: apid 1 (application processor)
cpu1: starting
cpu1: Intel Pentium III (686-class), 996.84 MHz, id 0x68a
cpu1: features 387fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu1: features 387fbff<PGE,MCA,CMOV,PAT,PSE36,PN,MMX>
cpu1: features 387fbff<FXSR,SSE>
cpu1: I-cache 16 KB 32b/line 4-way, D-cache 16 KB 32b/line 4-way
cpu1: L2 cache 256 KB 32b/line 8-way
cpu1: ITLB 32 4 KB entries 4-way, 2 4 MB entries fully associative
cpu1: DTLB 64 4 KB entries 4-way, 8 4 MB entries 4-way
cpu1: serial number 0000-068A-0002-9F6F-3E69-3D80
mpbios: bus 0 is type PCI
mpbios: bus 1 is type PCI
mpbios: bus 2 is type ISA
ioapic0 at mainbus0 apid 4 (I/O APIC)
ioapic0: pa 0xfec00000, version 11, 16 pins
ioapic1 at mainbus0 apid 5 (I/O APIC)
ioapic1: pa 0xfec01000, version 11, 16 pins
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pchb0 at pci0 dev 0 function 0
pchb0: ServerWorks CNB20LE Host (rev. 0x06)
pchb1 at pci0 dev 0 function 1
pchb1: ServerWorks CNB20LE Host (rev. 0x06)
pci1 at pchb1 bus 1
pci1: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
adv1 at pci1 dev 2 function 0: AdvanSys ABP-9xxUA SCSI adapter
adv1: interrupting at ioapic1 pin 11 (irq 10)
scsibus0 at adv1: 8 targets, 8 luns per target
vga1 at pci0 dev 1 function 0: ATI Technologies Rage XL (rev. 0x27)
wsdisplay0 at vga1 kbdmux 1: console (80x25, vt100 emulation)
wsmux1: connecting to wsdisplay0
fxp0 at pci0 dev 4 function 0: i82559 Ethernet, rev 8
fxp0: interrupting at ioapic1 pin 4 (irq 9)
fxp0: Ethernet address 00:e0:81:04:0f:7e
inphy0 at fxp0 phy 1: i82555 10/100 media interface, rev. 4
inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
fxp1 at pci0 dev 5 function 0: i82559 Ethernet, rev 8
fxp1: interrupting at ioapic1 pin 5 (irq 5)
fxp1: Ethernet address 00:e0:81:04:0f:7f
inphy1 at fxp1 phy 1: i82555 10/100 media interface, rev. 4
inphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
pcib0 at pci0 dev 15 function 0
pcib0: ServerWorks OSB4 SouthBridge (rev. 0x50)
rccide0 at pci0 dev 15 function 1
rccide0: ServerWorks OSB4 IDE Controller (rev. 0x00)
rccide0: bus-master DMA support present
rccide0: primary channel configured to compatibility mode
rccide0: primary channel interrupting at ioapic0 pin 14 (irq 14)
atabus0 at rccide0 channel 0
rccide0: secondary channel configured to compatibility mode
rccide0: secondary channel interrupting at ioapic0 pin 15 (irq 15)
atabus1 at rccide0 channel 1
ohci0 at pci0 dev 15 function 2: ServerWorks OSB4/CSB5 USB Host 
Controller (rev. 0x04)
ohci0: interrupting at ioapic0 pin 10 (irq 10)
ohci0: OHCI version 1.0, legacy support
usb0 at ohci0: USB revision 1.0
uhub0 at usb0
uhub0: ServerWorks OHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 4 ports with 4 removable, self powered
isa0 at pcib0
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0: console keyboard, using wsdisplay0
pcppi0 at isa0 port 0x61
midi0 at pcppi0: PC speaker
sysbeep0 at pcppi0
isapnp0 at isa0 port 0x279: ISA Plug 'n Play device support
npx0 at isa0 port 0xf0-0xff: using exception 16
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
isapnp0: no ISA Plug 'n Play devices found
ioapic1: enabling
ioapic0: enabling
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
Kernelized RAIDframe activated
scsibus0: waiting 2 seconds for devices to settle...
wd0 at atabus0 drive 0: <Maxtor 52049H4>
wd0: drive supports 16-sector PIO transfers, LBA addressing
wd0: 19541 MB, 39703 cyl, 16 head, 63 sec, 512 bytes/sect x 40020624 sectors
wd0: 32-bit data port
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd0(rccide0:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 2 
(Ultra/33) (using DMA data transfers)
wd1 at atabus1 drive 0: <Maxtor 52049H3>
wd1: drive supports 16-sector PIO transfers, LBA addressing
wd1: 19541 MB, 39704 cyl, 16 head, 63 sec, 512 bytes/sect x 40021632 sectors
wd1: 32-bit data port
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd1(rccide0:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 2 
(Ultra/33) (using DMA data transfers)
raid0: RAID Level 1
raid0: Components: /dev/wd0a /dev/wd1a[**FAILED**]
raid0: Total Sectors: 1056256 (515 MB)
raid4: RAID Level 1
raid4: Components: /dev/wd0b /dev/wd1b[**FAILED**]
raid4: Total Sectors: 1056256 (515 MB)
raid1: RAID Level 1
raid1: Components: /dev/wd0e /dev/wd1e[**FAILED**]
raid1: Total Sectors: 8450944 (4126 MB)
raid2: RAID Level 1
raid2: Components: /dev/wd0f /dev/wd1f[**FAILED**]
raid2: Total Sectors: 15845632 (7737 MB)
raid3: RAID Level 1
raid3: Components: /dev/wd0g /dev/wd1g[**FAILED**]
raid3: Total Sectors: 12692608 (6197 MB)
boot device: raid0
root on raid0a dumps on raid0b
root file system type: ffs
cpu1: CPU 1 running
raid0: Device already configured!
raid1: Device already configured!
raid2: Device already configured!
raid3: Device already configured!
wsdisplay0: screen 1 added (80x25, vt100 emulation)
wsdisplay0: screen 2 added (80x25, vt100 emulation)
wsdisplay0: screen 3 added (80x25, vt100 emulation)
wsdisplay0: screen 4 added (80x25, vt100 emulation)


> Have you tried isolating which of the RAID sets seems to be causing 
> the problem?
> 

I'll have to try that this weekend. A few weeks ago I tried bringing the 
  mirrors up one at a time and left things running for a while in 
between. raid0 seemed ok. But by time I got to the last one I wasn't 
sure about how long it would take for problems to come up. This is 
probably going to take several days to do.

Thanks!