Subject: 2 hangs in 24 hours on 1.5U system with raid5 disks...
To: None <current-users@netbsd.org>
From: Jeff Rizzo <riz@boogers.sf.ca.us>
List: current-users
Date: 09/20/2001 16:07:55
I have *no* idea if this has anything to do with the raid5 setup
I've just moved the system to, but since it's the only thing
that's changed in the last two weeks on an otherwise-stable system,
I suspect that *something* here is related.

Last night, and again this morning, my main NFS server/name server
machine, a PentiumII-350 running 1.5U locked up.  I was able to get into
DDB, but couldn't do anything else on the console - it was hung.
When I tried to "reboot" from the db> prompt, it hung solid after
"syncing disks..." (I waited for several hours this morning) and had to
be power cycled.

About a week ago, maybe two, I moved all the partitions (including swap)
from wd0 to raid0, a 3-disk raid5 setup which spans two scsi controllers.
The raid itself had been functioning (albeit with no load) for some time
prior to that... the only thing I changed in the kernel was to put 
root on raid0.  

The logs show me nothing.  No hardware errors, no nothing prior to the hang.
Unfortunately I didn't have the console connected, so I did not see if
there were any messages.

During the second hang, I had the presence of mind to do a "ps" at the db>
prompt (see below) to see what was running.  Nothing jumps out at me,
but then, I'm not sure what I'm looking for.

One other thing - when the machine came back both times, the superblock
of my raid0g partition (only) was corrupted... I had to use an alternate.

Any suggestions what I should be looking at to troubleshoot this?  Assuming
it happens again (and it's happened twice, so I have every expectation
it will again), what should I do at the db> prompt to gather more info
as to what's actually happening?  Are there known issues I should be
looking out for?

Thanks for any suggestions/help.

following are the 'ps' list from the hang, and a dmesg of the box.

db> ps                                       
 PID             PPID       PGRP        UID S   FLAGS          COMMAND    WAIT
 6064            6057       6055       6004 3  0x4004            therm flt_nor
 6063            6062       6054         99 3  0x4004             perl flt_nor
 6062            6056       6054         99 3  0x4084               sh    wait
 6057            6055       6055       6004 3  0x4084               sh   netio
 6056            6054       6054         99 3  0x4084             perl    wait
 6055            6052       6055       6004 3  0x4084               sh    wait
 6054            6051       6054         99 3  0x4084               sh    wait
 6052             266        266          0 3    0x84             cron   netio
 6051             266        266          0 3    0x84             cron   netio
 5807             221        221      32767 3   0x180            httpd   netio
 5805             221        221      32767 2   0x180            httpd        
 5778             221        221      32767 2   0x180            httpd
 5777             221        221      32767 2   0x180            httpd
 5776             221        221      32767 2   0x180            httpd
 5775             221        221      32767 2   0x180            httpd
 5774             221        221      32767 2   0x180            httpd
 5594            5241       5594       6004 4  0x500b             mutt
 5241            5238       5241       6004 3  0x4082             tcsh   ttyin
 5238             253        253          0 3   0x180             sshd  select
 5233            5227       5220         90 3  0x4182           dumper   netio
 5232            5227       5220         90 3  0x4183           dumper   netio
 5231            5227       5220         90 3  0x4183           dumper   netio
 5230            5227       5220         90 3  0x4107           dumper uvn_fp1
 5229            5228       5220         90 3    0x83            taper   netio
 5228            5227       5220         90 3  0x4083            taper   netio
 5227            5220       5220         90 3  0x4083           driver  select
 5220            5214       5220         90 3  0x4082               sh    wait
 5214            5211       5214         90 3  0x4082              csh   pause
 5211            5048       5211          0 3  0x4082              csh   pause
 5048            5047       5048       6004 3  0x4082             tcsh   pause
 5047             253        253          0 2   0x180             sshd        
 4421            4420       4421       6004 3  0x4082             tcsh   ttyin
 4420             253        253          0 2   0x180             sshd        
 484                1        484          0 2  0x4082            getty
 434                1        434          0 3     0x4             ntpd flt_nor
 347              333        347          0 3  0x4082              csh   ttyin
 333              332        333       6004 3  0x4082             tcsh   pause
 332                1        269       6004 3  0x4080             rxvt  select
 302              288        269       6004 3  0x4080      FvwmIconMan  select
 301              288        269       6004 3  0x4080        FvwmPager  select
 290              288        290       6004 3     0x4        ssh-agent flt_nor
 288              283        269       6004 3  0x4080            fvwm2  select
 287              283        287       6004 3  0x4004           xclock flt_pmf
 283                1        269       6004 3  0x4080              csh   pause
 280                1        269       6004 3  0x4004             Xvnc flt_pmf
 272                1        272          0 3  0x4082            getty   ttyin
 266                1        266          0 3     0x4             cron flt_nor
 263                1        263          0 3    0x80            inetd  select
 256                1        256          0 3     0x4         sendmail uao_get
 253                1        253          0 3    0x80             sshd  select
 251                1        251       1000 3    0x80         postgres  select
 242                1        242          0 3     0x4            named flt_nor
 221                1        221          0 3     0x4            httpd biowait
 217                1        217          0 3     0x5             nmbd flt_nor
 215                1        215          0 3    0x81             smbd  select
 213                1          9          0 3    0x82            snmpd  select
 211                1        211          0 3     0x4             afpd anonget
 209                1        209          0 3    0x80             papd  select
 198                1        198          0 3  0x1004           atalkd biowait
 161                1        161          0 3    0x80        rpc.lockd  select
 159              154        154          0 3    0x84             nfsd    nfsd
 158              154        154          0 3    0x84             nfsd    nfsd
 157              154        154          0 3    0x84             nfsd    nfsd
 156              154        154          0 3    0x84             nfsd    nfsd
 154                1        154          0 3    0x80             nfsd  select
 145                1        145          0 3    0x80           mountd  select
 117                1        117          0 2    0x80          rpcbind        
 113                1        113          0 3     0x4            named anonget
 103                1        103          0 3     0x4          syslogd flt_nor
 8                  0          0          0 3 0x20204         aiodoned aiodone
 7                  0          0          0 3 0x20204          ioflush drainvp
 6                  0          0          0 3 0x20204           reaper  reaper
 5                  0          0          0 3 0x20204       pagedaemon pgdaemo
 4                  0          0          0 3 0x20204             raid km_getw
 3                  0          0          0 3 0x20204             apm0   apmev
 2                  0          0          0 3 0x20204             usb0  usbevt
 1                  0          1          0 3  0x4080             init    wait
 0                 -1          0          0 3 0x20204          swapper schedpw
db>  


hubba:riz  ~> dmesg |more
NetBSD 1.5U (HUBBA) #3: Mon Sep 10 19:40:54 PDT 2001
    riz@hubba.boogers.sf.ca.us:/usr/work/netbsd/src/sys/arch/i386/compile/HUBBA
cpu0: Intel Pentium II/Celeron (Deschutes) (686-class), 350.83 MHz
cpu0: I-cache 16K 32b/line 4-way, D-cache 16K 32b/line 2/4-way
cpu0: L2 cache 512K 32b/line 4-way
cpu0: features 183fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features 183fbff<PGE,MCA,CMOV,FGPAT,PSE36,MMX,FXSR>
total memory = 127 MB
avail memory = 114 MB
using 1659 buffers containing 6636 KB of memory
BIOS32 rev. 0 found at 0xfb3c0
mainbus0 (root)
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled
pchb0 at pci0 dev 0 function 0
pchb0: Intel 82443BX Host Bridge/Controller (rev. 0x02)
ppb0 at pci0 dev 1 function 0: Intel 82443BX AGP Interface (rev. 0x02)
pci1 at ppb0 bus 1
pci1: i/o space, memory space enabled
vga1 at pci1 dev 0 function 0: S3 Savage3D (rev. 0x01)
wsdisplay0 at vga1
pcib0 at pci0 dev 7 function 0
pcib0: Intel 82371AB PCI-to-ISA Bridge (PIIX4) (rev. 0x02)
pciide0 at pci0 dev 7 function 1: Intel 82371AB IDE controller (PIIX4) (rev. 0x0
1)
pciide0: bus-master DMA support present
pciide0: primary channel wired to compatibility mode
wd0 at pciide0 channel 0 drive 0: <WDC AC420400D>
wd0: drive supports 16-sector PIO transfers, LBA addressing
wd0: 19470 MB, 16383 cyl, 16 head, 63 sec, 512 bytes/sect x 39876480 sectors
wd0: 32-bit data port
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 4 (Ultra/66)
pciide0: primary channel interrupting at irq 14
wd0(pciide0:0:0): using PIO mode 4, Ultra-DMA mode 2 (Ultra/33) (using DMA data 
transfers)
pciide0: secondary channel wired to compatibility mode
pciide0: disabling secondary channel (no drives)
uhci0 at pci0 dev 7 function 2: Intel 82371AB USB Host Controller (PIIX4) (rev. 
0x01)
uhci0: interrupting at irq 12
usb0 at uhci0: USB revision 1.0
uhub0 at usb0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
Intel 82371AB Power Management Controller (PIIX4) (miscellaneous bridge, revisio
n 0x02) at pci0 dev 7 function 3 not configured
ahc0 at pci0 dev 11 function 0
ahc0: interrupting at irq 10
ahc0: aic7880 Wide Channel A, SCSI Id=7, 16/255 SCBs
scsibus0 at ahc0 channel 0: 16 targets, 8 luns per target
de0 at pci0 dev 17 function 0
de0: interrupting at irq 9
de0: 21140A [10-100Mb/s] pass 2.2
de0: address 00:80:c8:7e:b6:15
de1 at pci0 dev 18 function 0
de1: interrupting at irq 5
de1: 21140A [10-100Mb/s] pass 2.2
de1: address 00:80:c8:27:06:40
adw0 at pci0 dev 20 function 0: AdvanSys ASB-3940UW-00 SCSI adapter
adw0: interrupting at irq 12
scsibus1 at adw0: 16 targets, 8 luns per target
isa0 at pcib0
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com0: console
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
lpt0 at isa0 port 0x378-0x37b irq 7
pcppi0 at isa0 port 0x61
spkr0 at pcppi0
sysbeep0 at pcppi0
npx0 at isa0 port 0xf0-0xff: using exception 16
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
apm0 at mainbus0: Power Management spec V1.2 (slowidle)
biomask fd47 netmask ff67 ttymask ffe7
scsibus0: waiting 2 seconds for devices to settle...
de0: enabling 10baseT port
ahc0: target 0 using 16bit transfers
ahc0: target 0 synchronous at 10.0MHz, offset = 0x8
ahc0: target 0 using tagged queuing
sd0 at scsibus0 target 0 lun 0: <IBM, XP31070W      !x, 81K6> SCSI2 0/direct fix
ed
sd0: 1074 MB, 3907 cyl, 5 head, 112 sec, 512 bytes/sect x 2199878 sectors
ahc0: target 1 using 16bit transfers
ahc0: target 1 synchronous at 20.0MHz, offset = 0x8
ahc0: target 1 using tagged queuing
sd1 at scsibus0 target 1 lun 0: <IBM, DDRS-39130D, DC1B> SCSI2 0/direct fixed
sd1: 8715 MB, 8387 cyl, 10 head, 212 sec, 512 bytes/sect x 17850000 sectors
ahc0: target 3 using 16bit transfers
ahc0: target 3 synchronous at 20.0MHz, offset = 0x8
ahc0: target 3 using tagged queuing
sd2 at scsibus0 target 3 lun 0: <IBM, DDRS-39130D, DC1B> SCSI2 0/direct fixed
sd2: 8715 MB, 8387 cyl, 10 head, 212 sec, 512 bytes/sect x 17850000 sectors
scsibus1: waiting 2 seconds for devices to settle...
adw0: target 0 using 8-bits wide, 6.7 MHz synchronous transfers
st0 at scsibus1 target 0 lun 0: <ARCHIVE, Python 04106-XXX, 7270> SCSI2 1/sequen
tial removable
st0: density code 37, 512-byte blocks, write-enabled
de1: autosense failed: cable problem?
adw0: target 8 using 16-bits wide, 20.8 MHz synchronous transfers
sd4 at scsibus1 target 8 lun 0: <COMPAQ, BD009122C6, B016> SCSI2 0/direct fixed
sd4: 8678 MB, 5273 cyl, 20 head, 168 sec, 512 bytes/sect x 17773524 sectors
Kernelized RAIDframe activated
IPsec: Initialized Security Association Processing.
RAID autoconfigure
Configuring raid0:
RAIDFRAME: protectedSectors is 64
RAIDFRAME: Configure (RAID Level 5): total number of sectors is 35398784 (17284 
MB)
RAIDFRAME(RAID Level 5): Using 20 floating recon bufs with head sep limit 10
boot device: raid0
root on raid0a dumps on raid0b
root file system type: ffs
raid0: Device already configured!
wsdisplay0: screen 0 added (80x25, vt100 emulation)
wsdisplay0: screen 1 added (80x25, vt100 emulation)
wsdisplay0: screen 2 added (80x25, vt100 emulation)
wsdisplay0: screen 3 added (80x25, vt100 emulation)
wsdisplay0: screen 4 added (80x25, vt100 emulation)
wsmux1: connecting to wsdisplay0
de1: autosense failed: cable problem?

-- 
Jeff Rizzo                                         http://boogers.sf.ca.us/~riz