Subject: Re: raidframe re-mirroring (cont'd)
To: Greg Oster <firstname.lastname@example.org>
From: Louis Guillaume <email@example.com>
Date: 08/13/2004 14:05:02
Greg Oster wrote:
> Louis Guillaume writes:
>>I posted a few weeks ago about a problem I had with a raid set, where
>>one disk was failed and I wanted to bring it back online. Here's what
>>. Booted into single-user
>>. Rebuilt all arrays on the pair of disks: raid0 raid1 raid2 raid3 raid4
>>- all raid-1. It's set up like this...
>>raid0 raid1 raid2 raid3 raid4
>>wd0a wd0e wd0f wd0g wd0b
>>wd1a wd1e wd1f wd1g wd1b
>>/ /usr /var /home swap
>>. fsck-ed all filesystems. reboot
>>Immediately, I noticed apache2 and spamass-milter fail during startup
>>(recently built from pkgsrc and very reliable). Immediatiely!
> How do they fail? What do they do/not do? (i.e. what is the nature
> of the error?)
They just die from Segmentation faults and dump cores when their rc
scripts run. It's not reliable, however, for example sometimes apache
won't dump it's core but still won't start.
>>what caused me to believe the second disk was bad in the first place.
>>Now I believed that the disk was actually bad and not the kernel/raidframe.
>>. Rebooted back to single user.
>>. Failed all wd1 raid components.
>>. fsck (finds and fixes errors) and reboot again.
>>All is well! For a week and a half, not a hitch.
>>More reason to believe it's the disk.
>>. Replace suspect disk with another one, disklabeled raidctl -a ...etc.
>>. Incorporated new spare components into arrays.
>>. rebooted. raidctl -F ... , fsck , reboot.
>>SAME FAILURES as before!! Apache2 and spamass-milter are the first to
>>go. In the past I had not noticed these right away and kept running.
>>This is very strange. I'd really like to get my redundancy back. But
>>once again, I'm running on a set of single-component raid-1 arrays.
>>Here is some other information that may be useful...
>>Machine - i386
>>Problem first noticed at NetBSD-2.0E GENERIC.MP kernel
>>Still a problem at NetBSD-2.0G GENERIC.MP kernel
>>I'm guessing my disk is good. The machine runs great on one disk. Weeks
>>of uptime - even months without a peep. So I'm not thinking that there's
>>a memory problem as someone suggested earlier.
>>The only other thing I can think of is perhaps the ribbon cable from the
>>board to the disk. But if that was bad, wouldn't we have much more
>>I don't know if this is a config problem, or something else. But there
>>definitely is a strange problem that's preventing me from mirroring
>>Perhaps too many raid devices on one pair of disks?
>>Maybe problems with MP kernel and raidframe?
> Not supposed to be. I havn't seen anything here that would suggest
>>Any help would be great. Please let me know if I can provide more
> The apache/milter errors would be useful. RAID config files and a
> 'dmesg' output would also help.
The processes die so quickly there doesn't seem to be any printed or
logged errors. Not sure where to look for this stuff.
Here's a typical raid config used to make each array...
$ cat /etc/raid0.conf
1 2 0
128 1 1 1
... naturally wd9x was replace with wd1x
NetBSD 2.0G (GENERIC.MP) #7: Fri Jul 2 18:26:58 EDT 2004
total memory = 255 MB
avail memory = 242 MB
BIOS32 rev. 0 found at 0xfdba0
mainbus0: Intel MP Specification (Version 1.4) (AMI CNB30LE )
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel Pentium III (686-class), 996.90 MHz, id 0x68a
cpu0: features 387fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features 387fbff<PGE,MCA,CMOV,PAT,PSE36,PN,MMX>
cpu0: features 387fbff<FXSR,SSE>
cpu0: I-cache 16 KB 32b/line 4-way, D-cache 16 KB 32b/line 4-way
cpu0: L2 cache 256 KB 32b/line 8-way
cpu0: ITLB 32 4 KB entries 4-way, 2 4 MB entries fully associative
cpu0: DTLB 64 4 KB entries 4-way, 8 4 MB entries 4-way
cpu0: serial number 0000-068A-0002-67F6-059B-2626
cpu0: calibrating local timer
cpu0: apic clock running at 132 MHz
cpu0: 8 page colors
cpu1 at mainbus0: apid 1 (application processor)
cpu1: Intel Pentium III (686-class), 996.84 MHz, id 0x68a
cpu1: features 387fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu1: features 387fbff<PGE,MCA,CMOV,PAT,PSE36,PN,MMX>
cpu1: features 387fbff<FXSR,SSE>
cpu1: I-cache 16 KB 32b/line 4-way, D-cache 16 KB 32b/line 4-way
cpu1: L2 cache 256 KB 32b/line 8-way
cpu1: ITLB 32 4 KB entries 4-way, 2 4 MB entries fully associative
cpu1: DTLB 64 4 KB entries 4-way, 8 4 MB entries 4-way
cpu1: serial number 0000-068A-0002-9F6F-3E69-3D80
mpbios: bus 0 is type PCI
mpbios: bus 1 is type PCI
mpbios: bus 2 is type ISA
ioapic0 at mainbus0 apid 4 (I/O APIC)
ioapic0: pa 0xfec00000, version 11, 16 pins
ioapic1 at mainbus0 apid 5 (I/O APIC)
ioapic1: pa 0xfec01000, version 11, 16 pins
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pchb0 at pci0 dev 0 function 0
pchb0: ServerWorks CNB20LE Host (rev. 0x06)
pchb1 at pci0 dev 0 function 1
pchb1: ServerWorks CNB20LE Host (rev. 0x06)
pci1 at pchb1 bus 1
pci1: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
adv1 at pci1 dev 2 function 0: AdvanSys ABP-9xxUA SCSI adapter
adv1: interrupting at ioapic1 pin 11 (irq 10)
scsibus0 at adv1: 8 targets, 8 luns per target
vga1 at pci0 dev 1 function 0: ATI Technologies Rage XL (rev. 0x27)
wsdisplay0 at vga1 kbdmux 1: console (80x25, vt100 emulation)
wsmux1: connecting to wsdisplay0
fxp0 at pci0 dev 4 function 0: i82559 Ethernet, rev 8
fxp0: interrupting at ioapic1 pin 4 (irq 9)
fxp0: Ethernet address 00:e0:81:04:0f:7e
inphy0 at fxp0 phy 1: i82555 10/100 media interface, rev. 4
inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
fxp1 at pci0 dev 5 function 0: i82559 Ethernet, rev 8
fxp1: interrupting at ioapic1 pin 5 (irq 5)
fxp1: Ethernet address 00:e0:81:04:0f:7f
inphy1 at fxp1 phy 1: i82555 10/100 media interface, rev. 4
inphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
pcib0 at pci0 dev 15 function 0
pcib0: ServerWorks OSB4 SouthBridge (rev. 0x50)
rccide0 at pci0 dev 15 function 1
rccide0: ServerWorks OSB4 IDE Controller (rev. 0x00)
rccide0: bus-master DMA support present
rccide0: primary channel configured to compatibility mode
rccide0: primary channel interrupting at ioapic0 pin 14 (irq 14)
atabus0 at rccide0 channel 0
rccide0: secondary channel configured to compatibility mode
rccide0: secondary channel interrupting at ioapic0 pin 15 (irq 15)
atabus1 at rccide0 channel 1
ohci0 at pci0 dev 15 function 2: ServerWorks OSB4/CSB5 USB Host
Controller (rev. 0x04)
ohci0: interrupting at ioapic0 pin 10 (irq 10)
ohci0: OHCI version 1.0, legacy support
usb0 at ohci0: USB revision 1.0
uhub0 at usb0
uhub0: ServerWorks OHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 4 ports with 4 removable, self powered
isa0 at pcib0
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0: console keyboard, using wsdisplay0
pcppi0 at isa0 port 0x61
midi0 at pcppi0: PC speaker
sysbeep0 at pcppi0
isapnp0 at isa0 port 0x279: ISA Plug 'n Play device support
npx0 at isa0 port 0xf0-0xff: using exception 16
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
isapnp0: no ISA Plug 'n Play devices found
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
Kernelized RAIDframe activated
scsibus0: waiting 2 seconds for devices to settle...
wd0 at atabus0 drive 0: <Maxtor 52049H4>
wd0: drive supports 16-sector PIO transfers, LBA addressing
wd0: 19541 MB, 39703 cyl, 16 head, 63 sec, 512 bytes/sect x 40020624 sectors
wd0: 32-bit data port
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd0(rccide0:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 2
(Ultra/33) (using DMA data transfers)
wd1 at atabus1 drive 0: <Maxtor 52049H3>
wd1: drive supports 16-sector PIO transfers, LBA addressing
wd1: 19541 MB, 39704 cyl, 16 head, 63 sec, 512 bytes/sect x 40021632 sectors
wd1: 32-bit data port
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd1(rccide0:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 2
(Ultra/33) (using DMA data transfers)
raid0: RAID Level 1
raid0: Components: /dev/wd0a /dev/wd1a[**FAILED**]
raid0: Total Sectors: 1056256 (515 MB)
raid4: RAID Level 1
raid4: Components: /dev/wd0b /dev/wd1b[**FAILED**]
raid4: Total Sectors: 1056256 (515 MB)
raid1: RAID Level 1
raid1: Components: /dev/wd0e /dev/wd1e[**FAILED**]
raid1: Total Sectors: 8450944 (4126 MB)
raid2: RAID Level 1
raid2: Components: /dev/wd0f /dev/wd1f[**FAILED**]
raid2: Total Sectors: 15845632 (7737 MB)
raid3: RAID Level 1
raid3: Components: /dev/wd0g /dev/wd1g[**FAILED**]
raid3: Total Sectors: 12692608 (6197 MB)
boot device: raid0
root on raid0a dumps on raid0b
root file system type: ffs
cpu1: CPU 1 running
raid0: Device already configured!
raid1: Device already configured!
raid2: Device already configured!
raid3: Device already configured!
wsdisplay0: screen 1 added (80x25, vt100 emulation)
wsdisplay0: screen 2 added (80x25, vt100 emulation)
wsdisplay0: screen 3 added (80x25, vt100 emulation)
wsdisplay0: screen 4 added (80x25, vt100 emulation)
> Have you tried isolating which of the RAID sets seems to be causing
> the problem?
I'll have to try that this weekend. A few weeks ago I tried bringing the
mirrors up one at a time and left things running for a while in
between. raid0 seemed ok. But by time I got to the last one I wasn't
sure about how long it would take for problems to come up. This is
probably going to take several days to do.