NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/46136: processes get stuck in D under high I/O load



>Number:         46136
>Category:       kern
>Synopsis:       processes get stuck in D under high I/O load
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Mar 03 15:10:00 +0000 2012
>Originator:     Hauke Fath
>Release:        NetBSD 6.0_BETA
>Organization:
TU Darmstadt
>Environment:
System: NetBSD venediger 6.0_BETA NetBSD 6.0_BETA (VENEDIGER) #0: Thu Mar 1 
18:10:56 CET 2012 
hf@Hochstuhl:/var/obj/netbsd-builds/6/i386/sys/arch/i386/compile/VENEDIGER i386
Architecture: i386
Machine: i386
>Description:

        We run an i386 machine equipped with a Super Micro X7SBE (4
        core Xeon) and a SCSI MegaRAID 320-4X as file server - mainly
        NFS.

        When we switched the RAID controller from a 320-2 to said
        320-4S under netbsd-5, the nfsd developed a tendency to get
        stuck in 'D' state every other day, making a reboot necessary.

        After upgrading to netbsd-6, and tuning buffer and pool sizes,
        the nfsd problem is somewhat mitigated, although there is
        still a string-and-ducttape script in place, which checks if
        nfsd is stuck in 'D' for an extended period of time, and
        reboots the machine.

        Unfortunately, the jobs started from /etc/daily get stuck,
        too, and wedge the machine such that even a 'reboot 0x04' from
        the debugger will not, and a hard reset is needed.

        From the debugger 'ps' output:

[...]
About to run shutdown hooks...
Stopping cron.
Waiting for PIDS: 826.
Stopping inetd.
Waiting for PIDS: 302.
Saved entropy to disk.
Turning off accounting.
Removing block-type swap devices
swapctl: removing /dev/ld0b as swap device
Sat Mar  3 10:50:53 CET 2012

Done running shutdown hooks.
Mar  3 10:50:59 venediger syslogd[184]: Exiting on signal 15
syncing disks... 3 done
[-- break #0(1) sent -- `\z' -- Sat Mar  3 10:53:16 2012]
fatal breakpoint trap in supervisor mode
trap type 1 code 0 eip c0183c64 cs 8 eflags 200286 cr2 bb688b04 ilevel 8
Stopped in pid 0.7 (system) at  netbsd:breakpoint+0x4:  popl    %ebp
db{0}> ps
PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
9408     1 3   0   9020000           c5ded800                amd tstile
16127    1 3   3   9020000           c5dedd40                amd tstile
545      1 3   1   9020000           c80a5000                amd tstile
17941    1 3   2         0           cc47bd40             reboot tstile
29808    1 3   3   9020000           cd494d20               find vmem
28944    1 3   0   9020000           c8a86560               find vmem
1        1 3   3   8020080           c5d78aa0               init wait
0       78 3   3       200           c538e020              nfsio nfsiod
0       77 3   2       200           c538e2c0              nfsio nfsiod
0       76 3   1       200           c538e560              nfsio nfsiod
0       75 3   2       200           c5ded560              nfsio nfsiod
0       74 5   3       200           c5e34000           (zombie)
0       73 3   3       200           c5ded020            physiod physiod
0       72 3   3       200           c5dc5d20           aiodoned aiodoned
0       71 3   2       200           c5d782c0            ioflush vmem
0       70 3   1       200           c5d78020           pgdaemon xclocv
0       67 3   3       200           c5d3b800          cryptoret crypto_w
0       66 3   3       200           c5d78560          atapibus0 sccomp
0       64 3   2       200           c5d25540               usb4 usbevt
0       63 3   0       200           c5d3b2c0               usb7 usbevt
0       62 3   3       200           c5d3b560               usb6 usbevt
0       61 3   1       200           c5d3baa0               usb5 usbevt
0       60 3   3       200           c5d78800               usb3 usbevt
0       59 3   3       200           c5d252a0              unpgc unpgc
0       58 3   0       200           c5d3bd40               usb0 usbevt
0       57 3   0       200           c5d25000               usb2 usbevt
0       56 3   2       200           c5d78d40         usbtask-dr usbtsk
0       55 3   3       200           c5d3c000         usbtask-hc usbtsk
0       54 3   3       200           c5d3c2a0               usb1 usbevt
0       53 3   0       200           c5d3c540        vmem_rehash vmem_rehash
0       52 3   0       200           c5d3c7e0          coretemp3 coretemp3
0       51 3   3       200           c5d3ca80          coretemp2 coretemp2
0       50 3   1       200           c5d3cd20          coretemp1 coretemp1
0       49 3   2       200           c5d3b020          coretemp0 coretemp0
0       40 3   2       200           c5d257e0            atabus3 atath
0       39 3   0       200           c5d25a80            atabus2 atath
0       38 3   3       200           c5d25d20               iic0 iicintr
0       37 3   2       200           c5b29020            atabus1 atath
0       36 3   0       200           c5b292c0            atabus0 atath
0       35 3   0       200           c5b29560               apm0 apmev
0       34 3   3       200           c5b29800            xcall/3 xcall
0       33 1   3       200           c5b29aa0          softser/3
0       32 1   3       200           c5b29d40          softclk/3
0       31 1   3       200           c5b1e000          softbio/3
0       30 1   3       200           c5b1e2a0          softnet/3
0    >  29 7   3       201           c5b1e540             idle/3
0       28 3   2       200           c5b1e7e0            xcall/2 xcall
0       27 1   2       200           c5b1ea80          softser/2
0       26 1   2       200           c5b1ed20          softclk/2
0       25 1   2       200           c5b1a020          softbio/2
0       24 1   2       200           c5b1a2c0          softnet/2
0    >  23 7   2       201           c5b1a560             idle/2
0       22 3   1       200           c5b1a800            xcall/1 xcall
0       21 1   1       200           c5b1aaa0          softser/1
0       20 1   1       200           c5b1ad40          softclk/1
0       19 1   1       200           c4ffb000          softbio/1
0       18 1   1       200           c4ffb2a0          softnet/1
0    >  17 7   1       201           c4ffb540             idle/1
0       16 3   0       200           c4ffb7e0             sysmon smtaskq
0       15 3   0       200           c4ffba80         pmfsuspend pmfsuspend
0       14 3   0       200           c4ffbd20           pmfevent pmfevent
0       13 3   3       200           c4ff5020         sopendfree sopendfr
0       12 3   0       200           c4ff52c0           nfssilly nfssilly
0       11 3   0       200           c4ff5560            cachegc cachegc
0       10 3   3       200           c4ff5800              vrele vrele
0        9 3   2       200           c4ff5aa0             vdrain vdrain
0        8 3   0       200           c4ff5d40          modunload mod_unld
0    >   7 7   0       200           c4fed000            xcall/0
0        6 1   0       200           c4fed2a0          softser/0
0        5 1   0       200           c4fed540          softclk/0
0        4 1   0       200           c4fed7e0          softbio/0
0        3 1   0       200           c4feda80          softnet/0
0        2 1   0       201           c4fedd20             idle/0
0        1 3   3       200           c0652400            swapper uvm
db{0}> rev boot 0x04
[-- break #0(1) sent -- `\z' -- Sat Mar  3 10:56:58 2012]
[-- break #0(1) sent -- `\z' -- Sat Mar  3 10:57:02 2012]

        [machine completely stuck]


        Note the reboot(8) in 'tstile', and the find(1) processes (the
        original culprits) in 'vmem'.

>How-To-Repeat:

        Run netbsd-6 on a busy, scsi raid based  nfs fileserver.
 
>Fix:
        
        None I can see. 

        The machine is easy to upset, so I can quickly provide any
        details someone knowledgable might be interested in, including
        ddb dances.

        (Re-sent because of botched sender mail address)

>Unformatted:
 


Home | Main Index | Thread Index | Old Index