NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/46136: processes get stuck in D under high I/O load
>Number: 46136
>Category: kern
>Synopsis: processes get stuck in D under high I/O load
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Mar 03 15:10:00 +0000 2012
>Originator: Hauke Fath
>Release: NetBSD 6.0_BETA
>Organization:
TU Darmstadt
>Environment:
System: NetBSD venediger 6.0_BETA NetBSD 6.0_BETA (VENEDIGER) #0: Thu Mar 1
18:10:56 CET 2012
hf@Hochstuhl:/var/obj/netbsd-builds/6/i386/sys/arch/i386/compile/VENEDIGER i386
Architecture: i386
Machine: i386
>Description:
We run an i386 machine equipped with a Super Micro X7SBE (4
core Xeon) and a SCSI MegaRAID 320-4X as file server - mainly
NFS.
When we switched the RAID controller from a 320-2 to said
320-4S under netbsd-5, the nfsd developed a tendency to get
stuck in 'D' state every other day, making a reboot necessary.
After upgrading to netbsd-6, and tuning buffer and pool sizes,
the nfsd problem is somewhat mitigated, although there is
still a string-and-ducttape script in place, which checks if
nfsd is stuck in 'D' for an extended period of time, and
reboots the machine.
Unfortunately, the jobs started from /etc/daily get stuck,
too, and wedge the machine such that even a 'reboot 0x04' from
the debugger will not, and a hard reset is needed.
From the debugger 'ps' output:
[...]
About to run shutdown hooks...
Stopping cron.
Waiting for PIDS: 826.
Stopping inetd.
Waiting for PIDS: 302.
Saved entropy to disk.
Turning off accounting.
Removing block-type swap devices
swapctl: removing /dev/ld0b as swap device
Sat Mar 3 10:50:53 CET 2012
Done running shutdown hooks.
Mar 3 10:50:59 venediger syslogd[184]: Exiting on signal 15
syncing disks... 3 done
[-- break #0(1) sent -- `\z' -- Sat Mar 3 10:53:16 2012]
fatal breakpoint trap in supervisor mode
trap type 1 code 0 eip c0183c64 cs 8 eflags 200286 cr2 bb688b04 ilevel 8
Stopped in pid 0.7 (system) at netbsd:breakpoint+0x4: popl %ebp
db{0}> ps
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
9408 1 3 0 9020000 c5ded800 amd tstile
16127 1 3 3 9020000 c5dedd40 amd tstile
545 1 3 1 9020000 c80a5000 amd tstile
17941 1 3 2 0 cc47bd40 reboot tstile
29808 1 3 3 9020000 cd494d20 find vmem
28944 1 3 0 9020000 c8a86560 find vmem
1 1 3 3 8020080 c5d78aa0 init wait
0 78 3 3 200 c538e020 nfsio nfsiod
0 77 3 2 200 c538e2c0 nfsio nfsiod
0 76 3 1 200 c538e560 nfsio nfsiod
0 75 3 2 200 c5ded560 nfsio nfsiod
0 74 5 3 200 c5e34000 (zombie)
0 73 3 3 200 c5ded020 physiod physiod
0 72 3 3 200 c5dc5d20 aiodoned aiodoned
0 71 3 2 200 c5d782c0 ioflush vmem
0 70 3 1 200 c5d78020 pgdaemon xclocv
0 67 3 3 200 c5d3b800 cryptoret crypto_w
0 66 3 3 200 c5d78560 atapibus0 sccomp
0 64 3 2 200 c5d25540 usb4 usbevt
0 63 3 0 200 c5d3b2c0 usb7 usbevt
0 62 3 3 200 c5d3b560 usb6 usbevt
0 61 3 1 200 c5d3baa0 usb5 usbevt
0 60 3 3 200 c5d78800 usb3 usbevt
0 59 3 3 200 c5d252a0 unpgc unpgc
0 58 3 0 200 c5d3bd40 usb0 usbevt
0 57 3 0 200 c5d25000 usb2 usbevt
0 56 3 2 200 c5d78d40 usbtask-dr usbtsk
0 55 3 3 200 c5d3c000 usbtask-hc usbtsk
0 54 3 3 200 c5d3c2a0 usb1 usbevt
0 53 3 0 200 c5d3c540 vmem_rehash vmem_rehash
0 52 3 0 200 c5d3c7e0 coretemp3 coretemp3
0 51 3 3 200 c5d3ca80 coretemp2 coretemp2
0 50 3 1 200 c5d3cd20 coretemp1 coretemp1
0 49 3 2 200 c5d3b020 coretemp0 coretemp0
0 40 3 2 200 c5d257e0 atabus3 atath
0 39 3 0 200 c5d25a80 atabus2 atath
0 38 3 3 200 c5d25d20 iic0 iicintr
0 37 3 2 200 c5b29020 atabus1 atath
0 36 3 0 200 c5b292c0 atabus0 atath
0 35 3 0 200 c5b29560 apm0 apmev
0 34 3 3 200 c5b29800 xcall/3 xcall
0 33 1 3 200 c5b29aa0 softser/3
0 32 1 3 200 c5b29d40 softclk/3
0 31 1 3 200 c5b1e000 softbio/3
0 30 1 3 200 c5b1e2a0 softnet/3
0 > 29 7 3 201 c5b1e540 idle/3
0 28 3 2 200 c5b1e7e0 xcall/2 xcall
0 27 1 2 200 c5b1ea80 softser/2
0 26 1 2 200 c5b1ed20 softclk/2
0 25 1 2 200 c5b1a020 softbio/2
0 24 1 2 200 c5b1a2c0 softnet/2
0 > 23 7 2 201 c5b1a560 idle/2
0 22 3 1 200 c5b1a800 xcall/1 xcall
0 21 1 1 200 c5b1aaa0 softser/1
0 20 1 1 200 c5b1ad40 softclk/1
0 19 1 1 200 c4ffb000 softbio/1
0 18 1 1 200 c4ffb2a0 softnet/1
0 > 17 7 1 201 c4ffb540 idle/1
0 16 3 0 200 c4ffb7e0 sysmon smtaskq
0 15 3 0 200 c4ffba80 pmfsuspend pmfsuspend
0 14 3 0 200 c4ffbd20 pmfevent pmfevent
0 13 3 3 200 c4ff5020 sopendfree sopendfr
0 12 3 0 200 c4ff52c0 nfssilly nfssilly
0 11 3 0 200 c4ff5560 cachegc cachegc
0 10 3 3 200 c4ff5800 vrele vrele
0 9 3 2 200 c4ff5aa0 vdrain vdrain
0 8 3 0 200 c4ff5d40 modunload mod_unld
0 > 7 7 0 200 c4fed000 xcall/0
0 6 1 0 200 c4fed2a0 softser/0
0 5 1 0 200 c4fed540 softclk/0
0 4 1 0 200 c4fed7e0 softbio/0
0 3 1 0 200 c4feda80 softnet/0
0 2 1 0 201 c4fedd20 idle/0
0 1 3 3 200 c0652400 swapper uvm
db{0}> rev boot 0x04
[-- break #0(1) sent -- `\z' -- Sat Mar 3 10:56:58 2012]
[-- break #0(1) sent -- `\z' -- Sat Mar 3 10:57:02 2012]
[machine completely stuck]
Note the reboot(8) in 'tstile', and the find(1) processes (the
original culprits) in 'vmem'.
>How-To-Repeat:
Run netbsd-6 on a busy, scsi raid based nfs fileserver.
>Fix:
None I can see.
The machine is easy to upset, so I can quickly provide any
details someone knowledgable might be interested in, including
ddb dances.
(Re-sent because of botched sender mail address)
>Unformatted:
Home |
Main Index |
Thread Index |
Old Index