Subject: 2 hangs in 24 hours on 1.5U system with raid5 disks...
To: None <current-users@netbsd.org>
From: Jeff Rizzo <riz@boogers.sf.ca.us>
List: current-users
Date: 09/20/2001 16:07:55
I have *no* idea if this has anything to do with the raid5 setup
I've just moved the system to, but since it's the only thing
that's changed in the last two weeks on an otherwise-stable system,
I suspect that *something* here is related.
Last night, and again this morning, my main NFS server/name server
machine, a PentiumII-350 running 1.5U locked up. I was able to get into
DDB, but couldn't do anything else on the console - it was hung.
When I tried to "reboot" from the db> prompt, it hung solid after
"syncing disks..." (I waited for several hours this morning) and had to
be power cycled.
About a week ago, maybe two, I moved all the partitions (including swap)
from wd0 to raid0, a 3-disk raid5 setup which spans two scsi controllers.
The raid itself had been functioning (albeit with no load) for some time
prior to that... the only thing I changed in the kernel was to put
root on raid0.
The logs show me nothing. No hardware errors, no nothing prior to the hang.
Unfortunately I didn't have the console connected, so I did not see if
there were any messages.
During the second hang, I had the presence of mind to do a "ps" at the db>
prompt (see below) to see what was running. Nothing jumps out at me,
but then, I'm not sure what I'm looking for.
One other thing - when the machine came back both times, the superblock
of my raid0g partition (only) was corrupted... I had to use an alternate.
Any suggestions what I should be looking at to troubleshoot this? Assuming
it happens again (and it's happened twice, so I have every expectation
it will again), what should I do at the db> prompt to gather more info
as to what's actually happening? Are there known issues I should be
looking out for?
Thanks for any suggestions/help.
following are the 'ps' list from the hang, and a dmesg of the box.
db> ps
PID PPID PGRP UID S FLAGS COMMAND WAIT
6064 6057 6055 6004 3 0x4004 therm flt_nor
6063 6062 6054 99 3 0x4004 perl flt_nor
6062 6056 6054 99 3 0x4084 sh wait
6057 6055 6055 6004 3 0x4084 sh netio
6056 6054 6054 99 3 0x4084 perl wait
6055 6052 6055 6004 3 0x4084 sh wait
6054 6051 6054 99 3 0x4084 sh wait
6052 266 266 0 3 0x84 cron netio
6051 266 266 0 3 0x84 cron netio
5807 221 221 32767 3 0x180 httpd netio
5805 221 221 32767 2 0x180 httpd
5778 221 221 32767 2 0x180 httpd
5777 221 221 32767 2 0x180 httpd
5776 221 221 32767 2 0x180 httpd
5775 221 221 32767 2 0x180 httpd
5774 221 221 32767 2 0x180 httpd
5594 5241 5594 6004 4 0x500b mutt
5241 5238 5241 6004 3 0x4082 tcsh ttyin
5238 253 253 0 3 0x180 sshd select
5233 5227 5220 90 3 0x4182 dumper netio
5232 5227 5220 90 3 0x4183 dumper netio
5231 5227 5220 90 3 0x4183 dumper netio
5230 5227 5220 90 3 0x4107 dumper uvn_fp1
5229 5228 5220 90 3 0x83 taper netio
5228 5227 5220 90 3 0x4083 taper netio
5227 5220 5220 90 3 0x4083 driver select
5220 5214 5220 90 3 0x4082 sh wait
5214 5211 5214 90 3 0x4082 csh pause
5211 5048 5211 0 3 0x4082 csh pause
5048 5047 5048 6004 3 0x4082 tcsh pause
5047 253 253 0 2 0x180 sshd
4421 4420 4421 6004 3 0x4082 tcsh ttyin
4420 253 253 0 2 0x180 sshd
484 1 484 0 2 0x4082 getty
434 1 434 0 3 0x4 ntpd flt_nor
347 333 347 0 3 0x4082 csh ttyin
333 332 333 6004 3 0x4082 tcsh pause
332 1 269 6004 3 0x4080 rxvt select
302 288 269 6004 3 0x4080 FvwmIconMan select
301 288 269 6004 3 0x4080 FvwmPager select
290 288 290 6004 3 0x4 ssh-agent flt_nor
288 283 269 6004 3 0x4080 fvwm2 select
287 283 287 6004 3 0x4004 xclock flt_pmf
283 1 269 6004 3 0x4080 csh pause
280 1 269 6004 3 0x4004 Xvnc flt_pmf
272 1 272 0 3 0x4082 getty ttyin
266 1 266 0 3 0x4 cron flt_nor
263 1 263 0 3 0x80 inetd select
256 1 256 0 3 0x4 sendmail uao_get
253 1 253 0 3 0x80 sshd select
251 1 251 1000 3 0x80 postgres select
242 1 242 0 3 0x4 named flt_nor
221 1 221 0 3 0x4 httpd biowait
217 1 217 0 3 0x5 nmbd flt_nor
215 1 215 0 3 0x81 smbd select
213 1 9 0 3 0x82 snmpd select
211 1 211 0 3 0x4 afpd anonget
209 1 209 0 3 0x80 papd select
198 1 198 0 3 0x1004 atalkd biowait
161 1 161 0 3 0x80 rpc.lockd select
159 154 154 0 3 0x84 nfsd nfsd
158 154 154 0 3 0x84 nfsd nfsd
157 154 154 0 3 0x84 nfsd nfsd
156 154 154 0 3 0x84 nfsd nfsd
154 1 154 0 3 0x80 nfsd select
145 1 145 0 3 0x80 mountd select
117 1 117 0 2 0x80 rpcbind
113 1 113 0 3 0x4 named anonget
103 1 103 0 3 0x4 syslogd flt_nor
8 0 0 0 3 0x20204 aiodoned aiodone
7 0 0 0 3 0x20204 ioflush drainvp
6 0 0 0 3 0x20204 reaper reaper
5 0 0 0 3 0x20204 pagedaemon pgdaemo
4 0 0 0 3 0x20204 raid km_getw
3 0 0 0 3 0x20204 apm0 apmev
2 0 0 0 3 0x20204 usb0 usbevt
1 0 1 0 3 0x4080 init wait
0 -1 0 0 3 0x20204 swapper schedpw
db>
hubba:riz ~> dmesg |more
NetBSD 1.5U (HUBBA) #3: Mon Sep 10 19:40:54 PDT 2001
riz@hubba.boogers.sf.ca.us:/usr/work/netbsd/src/sys/arch/i386/compile/HUBBA
cpu0: Intel Pentium II/Celeron (Deschutes) (686-class), 350.83 MHz
cpu0: I-cache 16K 32b/line 4-way, D-cache 16K 32b/line 2/4-way
cpu0: L2 cache 512K 32b/line 4-way
cpu0: features 183fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features 183fbff<PGE,MCA,CMOV,FGPAT,PSE36,MMX,FXSR>
total memory = 127 MB
avail memory = 114 MB
using 1659 buffers containing 6636 KB of memory
BIOS32 rev. 0 found at 0xfb3c0
mainbus0 (root)
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled
pchb0 at pci0 dev 0 function 0
pchb0: Intel 82443BX Host Bridge/Controller (rev. 0x02)
ppb0 at pci0 dev 1 function 0: Intel 82443BX AGP Interface (rev. 0x02)
pci1 at ppb0 bus 1
pci1: i/o space, memory space enabled
vga1 at pci1 dev 0 function 0: S3 Savage3D (rev. 0x01)
wsdisplay0 at vga1
pcib0 at pci0 dev 7 function 0
pcib0: Intel 82371AB PCI-to-ISA Bridge (PIIX4) (rev. 0x02)
pciide0 at pci0 dev 7 function 1: Intel 82371AB IDE controller (PIIX4) (rev. 0x0
1)
pciide0: bus-master DMA support present
pciide0: primary channel wired to compatibility mode
wd0 at pciide0 channel 0 drive 0: <WDC AC420400D>
wd0: drive supports 16-sector PIO transfers, LBA addressing
wd0: 19470 MB, 16383 cyl, 16 head, 63 sec, 512 bytes/sect x 39876480 sectors
wd0: 32-bit data port
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 4 (Ultra/66)
pciide0: primary channel interrupting at irq 14
wd0(pciide0:0:0): using PIO mode 4, Ultra-DMA mode 2 (Ultra/33) (using DMA data
transfers)
pciide0: secondary channel wired to compatibility mode
pciide0: disabling secondary channel (no drives)
uhci0 at pci0 dev 7 function 2: Intel 82371AB USB Host Controller (PIIX4) (rev.
0x01)
uhci0: interrupting at irq 12
usb0 at uhci0: USB revision 1.0
uhub0 at usb0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
Intel 82371AB Power Management Controller (PIIX4) (miscellaneous bridge, revisio
n 0x02) at pci0 dev 7 function 3 not configured
ahc0 at pci0 dev 11 function 0
ahc0: interrupting at irq 10
ahc0: aic7880 Wide Channel A, SCSI Id=7, 16/255 SCBs
scsibus0 at ahc0 channel 0: 16 targets, 8 luns per target
de0 at pci0 dev 17 function 0
de0: interrupting at irq 9
de0: 21140A [10-100Mb/s] pass 2.2
de0: address 00:80:c8:7e:b6:15
de1 at pci0 dev 18 function 0
de1: interrupting at irq 5
de1: 21140A [10-100Mb/s] pass 2.2
de1: address 00:80:c8:27:06:40
adw0 at pci0 dev 20 function 0: AdvanSys ASB-3940UW-00 SCSI adapter
adw0: interrupting at irq 12
scsibus1 at adw0: 16 targets, 8 luns per target
isa0 at pcib0
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com0: console
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
lpt0 at isa0 port 0x378-0x37b irq 7
pcppi0 at isa0 port 0x61
spkr0 at pcppi0
sysbeep0 at pcppi0
npx0 at isa0 port 0xf0-0xff: using exception 16
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
apm0 at mainbus0: Power Management spec V1.2 (slowidle)
biomask fd47 netmask ff67 ttymask ffe7
scsibus0: waiting 2 seconds for devices to settle...
de0: enabling 10baseT port
ahc0: target 0 using 16bit transfers
ahc0: target 0 synchronous at 10.0MHz, offset = 0x8
ahc0: target 0 using tagged queuing
sd0 at scsibus0 target 0 lun 0: <IBM, XP31070W !x, 81K6> SCSI2 0/direct fix
ed
sd0: 1074 MB, 3907 cyl, 5 head, 112 sec, 512 bytes/sect x 2199878 sectors
ahc0: target 1 using 16bit transfers
ahc0: target 1 synchronous at 20.0MHz, offset = 0x8
ahc0: target 1 using tagged queuing
sd1 at scsibus0 target 1 lun 0: <IBM, DDRS-39130D, DC1B> SCSI2 0/direct fixed
sd1: 8715 MB, 8387 cyl, 10 head, 212 sec, 512 bytes/sect x 17850000 sectors
ahc0: target 3 using 16bit transfers
ahc0: target 3 synchronous at 20.0MHz, offset = 0x8
ahc0: target 3 using tagged queuing
sd2 at scsibus0 target 3 lun 0: <IBM, DDRS-39130D, DC1B> SCSI2 0/direct fixed
sd2: 8715 MB, 8387 cyl, 10 head, 212 sec, 512 bytes/sect x 17850000 sectors
scsibus1: waiting 2 seconds for devices to settle...
adw0: target 0 using 8-bits wide, 6.7 MHz synchronous transfers
st0 at scsibus1 target 0 lun 0: <ARCHIVE, Python 04106-XXX, 7270> SCSI2 1/sequen
tial removable
st0: density code 37, 512-byte blocks, write-enabled
de1: autosense failed: cable problem?
adw0: target 8 using 16-bits wide, 20.8 MHz synchronous transfers
sd4 at scsibus1 target 8 lun 0: <COMPAQ, BD009122C6, B016> SCSI2 0/direct fixed
sd4: 8678 MB, 5273 cyl, 20 head, 168 sec, 512 bytes/sect x 17773524 sectors
Kernelized RAIDframe activated
IPsec: Initialized Security Association Processing.
RAID autoconfigure
Configuring raid0:
RAIDFRAME: protectedSectors is 64
RAIDFRAME: Configure (RAID Level 5): total number of sectors is 35398784 (17284
MB)
RAIDFRAME(RAID Level 5): Using 20 floating recon bufs with head sep limit 10
boot device: raid0
root on raid0a dumps on raid0b
root file system type: ffs
raid0: Device already configured!
wsdisplay0: screen 0 added (80x25, vt100 emulation)
wsdisplay0: screen 1 added (80x25, vt100 emulation)
wsdisplay0: screen 2 added (80x25, vt100 emulation)
wsdisplay0: screen 3 added (80x25, vt100 emulation)
wsdisplay0: screen 4 added (80x25, vt100 emulation)
wsmux1: connecting to wsdisplay0
de1: autosense failed: cable problem?
--
Jeff Rizzo http://boogers.sf.ca.us/~riz