Subject: Re: Possible serious bug in NetBSD-1.6.1_RC2
To: Greg Oster <oster@cs.usask.ca>
From: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
List: current-users
Date: 03/14/2003 07:35:46
	Hello Greg.   Well, we've had two hangs in 48 hours, and the second
one wasn't due to a lot of files coming and going in combination with the
parity checker. There's something in the daily routine of this machine
which is triggering the problem, but I'm certain I'm not sure what it is.
We're paging to wd1b, which is a stand-alone partition on one of the hard
drives which is part of the raid. There are no hard drives in this machine
which are not part of the raid set.  Dmesg output is below, as are a copy
of the /etc/raid0.conf file and the output of disklabel(8) from the raid
set.  Each of the disks in the raid set has an a and b partition in
addition to the raid partition.  Each a partition contains a basic NetBSD
installation, and each b partition is reserved for swap, should the need to
boot off a given bsd installation arrise.  The idea here isnot that the
machine is completely redundant, but that given any single disk failure, we
can reboot and recover, even if it means we have to physically recable
disks to do it.
	If you're struck by any thoughts about this setup, let me know.  And,
of course, if you have any thoughts about what might be causing this
problem in the first place, I'm extremely interested.

-thanks
-Brian

#Raid Configuration File for lothlorien.nfbcal.org (BB 11/19/2002)
#Brian Buhrow
#Describe the size of the array, including spares
START array
#numrow numcol numspare
1 3 1

#Disk section
START disks
/dev/wd0e
/dev/wd1e
/dev/wd2e

#Layout section.  We'll use 63 sectors per stripe unit, 1 parity unit per 
#stripe unit, 1 parity unit per stripe, and raid level 5.
START layout
#SectperSu SusperParityUnit SusperReconUnit Raid_level
64 1 1 5

#Fifo section.  We'll use 100 outstanding requests as a start.
START queue
fifo 100

#spare section
#We're designating /dev/wd3e as the hot spare.
START spare
/dev/wd3e


<dmesg output>
NetBSD 1.6.1_RC2 (NFBNETBSD) #0: Tue Mar 11 16:02:17 PST 2003
    buhrow@lothlorien.nfbcal.org:/usr/local/netbsd/src/sys/arch/i386/compile/NFBNETBSD
cpu0: Intel Pentium III (Coppermine) (686-class), 756.83 MHz
cpu0: I-cache 16 KB 32b/line 4-way, D-cache 16 KB 32b/line 2-way
cpu0: L2 cache 256 KB 32b/line 8-way
cpu0: features 383f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR>
cpu0: features 383f9ff<PGE,MCA,CMOV,FGPAT,PSE36,MMX>
cpu0: features 383f9ff<FXSR,SSE>
total memory = 126 MB
avail memory = 112 MB
using 1646 buffers containing 6584 KB of memory
BIOS32 rev. 0 found at 0xf06b0
PCI BIOS rev. 2.1 found at 0xf08b0
PCI IRQ Routing Table rev. 1.0 found at 0xf0e70, size 208 bytes (11 entries)
PCI Interrupt Router at 000:31:0 (Intel 82371FB PCI-to-ISA Bridge (PIIX))
mainbus0 (root)
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pchb0 at pci0 dev 0 function 0
pchb0: Intel 82810E Memory Controller Hub (rev. 0x03)
pchb0: random number generator enabled
agp0 at pchb0: aperture at 0xe4000000, size 0x4000000
vga1 at pci0 dev 1 function 0: Intel 82810E Graphics Controller (rev. 0x03)
wsdisplay0 at vga1 kbdmux 1
wsmux1: connecting to wsdisplay0
ppb0 at pci0 dev 30 function 0: Intel 82801AA Hub-to-PCI Bridge (rev. 0x02)
pci1 at ppb0 bus 1
pci1: i/o space, memory space enabled
clcs0 at pci1 dev 5 function 0: Cirrus Logic CS4280 CrystalClear Audio Interface (rev. 0x01)
clcs0: interrupting at irq 10
clcs0: CRY20 codec; headphone, 20 bit DAC, 18 bit ADC, Crystal Semi 3D
audio0 at clcs0: full duplex, independent
midi0 at clcs0: CS4280 MIDI UART
pciide1 at pci1 dev 8 function 0: Promise Ultra133/ATA Bus Master IDE Accelerator (rev. 0x02)
pciide1: bus-master DMA support present
pciide1: primary channel configured to native-PCI mode
pciide1: using irq 11 for native-PCI interrupt
wd1 at pciide1 channel 0 drive 0: <ST380021A>
wd1: drive supports 16-sector PIO transfers, LBA addressing
wd1: 76319 MB, 16383 cyl, 16 head, 63 sec, 512 bytes/sect x 156301488 sectors
wd1: 32-bit data port
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd1(pciide1:0:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA data transfers)
pciide1: secondary channel configured to native-PCI mode
wd2 at pciide1 channel 1 drive 0: <ST380021A>
wd2: drive supports 16-sector PIO transfers, LBA addressing
wd2: 76319 MB, 16383 cyl, 16 head, 63 sec, 512 bytes/sect x 156301488 sectors
wd2: 32-bit data port
wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd3 at pciide1 channel 1 drive 1: <ST380021A>
wd3: drive supports 16-sector PIO transfers, LBA addressing
wd3: 76319 MB, 16383 cyl, 16 head, 63 sec, 512 bytes/sect x 156301488 sectors
wd3: 32-bit data port
wd3: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd2(pciide1:1:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA data transfers)
wd3(pciide1:1:1): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA data transfers)
tlp0 at pci1 dev 10 function 0: Lite-On 82C169 Ethernet, pass 2.0
tlp0: interrupting at irq 12
tlp0: Ethernet address 00:a0:cc:60:b6:fd
bmtphy0 at tlp0 phy 1: BCM5201 10/100 media interface, rev. 2
bmtphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
pcib0 at pci0 dev 31 function 0
pcib0: Intel 82801AA LPC Interface Bridge (rev. 0x02)
pciide0 at pci0 dev 31 function 1: Intel 82801AA IDE Controller (ICH) (rev. 0x02)
pciide0: bus-master DMA support present
pciide0: primary channel wired to compatibility mode
wd0 at pciide0 channel 0 drive 0: <ST380021A>
wd0: drive supports 16-sector PIO transfers, LBA addressing
wd0: 76319 MB, 16383 cyl, 16 head, 63 sec, 512 bytes/sect x 156301488 sectors
wd0: 32-bit data port
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
pciide0: primary channel interrupting at irq 14
wd0(pciide0:0:0): using PIO mode 4, Ultra-DMA mode 4 (Ultra/66) (using DMA data transfers)
pciide0: secondary channel wired to compatibility mode
atapibus0 at pciide0 channel 1: 2 targets
cd0 at atapibus0 drive 0: <FX4820T, , D03D> type 5 cdrom removable
cd0: 32-bit data port
cd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 2 (Ultra/33)
pciide0: secondary channel interrupting at irq 15
cd0(pciide0:1:0): using PIO mode 4, Ultra-DMA mode 2 (Ultra/33) (using DMA data transfers)
uhci0 at pci0 dev 31 function 2: Intel 82801AA USB Controller (rev. 0x02)
uhci0: interrupting at irq 9
usb0 at uhci0: USB revision 1.0
uhub0 at usb0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
Intel 82801AA SMBus Controller (SMBus serial bus, revision 0x02) at pci0 dev 31 function 3 not configured
isa0 at pcib0
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com0: console
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
lpt0 at isa0 port 0x378-0x37b irq 7
pcppi0 at isa0 port 0x61
midi1 at pcppi0: PC speaker
spkr0 at pcppi0
sysbeep0 at pcppi0
isapnp0 at isa0 port 0x279: ISA Plug 'n Play device support
npx0 at isa0 port 0xf0-0xff: using exception 16
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
isapnp0: no ISA Plug 'n Play devices found
apm0 at mainbus0: Power Management spec V1.2
APM power mgmt engage (device 1): power management disabled (0x10f)
biomask eb67 netmask fb67 ttymask fbe7
Kernelized RAIDframe activated
RAID autoconfigure
Configuring raid0:
RAIDFRAME: protectedSectors is 64
RAIDFRAME: Configure (RAID Level 5): total number of sectors is 304213760 (148541 MB)
RAIDFRAME(RAID Level 5): Using 20 floating recon bufs with head sep limit 10
boot device: raid0
root on raid0a dumps on wd0b
root file system type: ffs
raid0: Device already configured!

<output of disklabel(8) raid0>

# /dev/rraid0d:
type: RAID
disk: raid
label: default label
flags:
bytes/sector: 512
sectors/track: 128
tracks/cylinder: 12
sectors/cylinder: 1536
cylinders: 198055
total sectors: 304213760
rpm: 3600
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0		# microseconds
track-to-track seek: 0	# microseconds
drivedata: 0 

9 partitions:
#        size    offset     fstype  [fsize bsize cpg/sgs]
 a:  10752000         0     4.2BSD   1024  8192    58   # (Cyl.    0 - 6999)
 b:   4608000  10752000       swap                      # (Cyl. 7000 - 9999)
 c: 304213760         0     unused      0     0         # (Cyl.    0 - 198055*)
 d: 304213760         0     4.2BSD      0     0     0   # (Cyl.    0 - 198055*)
 e:  21504000  15360000     4.2BSD   4096 16384   387   # (Cyl. 10000 - 23999)
 f:  21504000  36864000     4.2BSD   4096 16384   387   # (Cyl. 24000 - 37999)
 g:  21504000  58368000     4.2BSD   4096 16384   387   # (Cyl. 38000 - 51999)
 h:  84480000  79872000     4.2BSD   4096 16384   385   # (Cyl. 52000 - 106999)
 i: 139861760 164352000     4.2BSD   4096 16384   385   # (Cyl. 107000 - 198055*)