NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/51474: unbound can get stuck in fd_close()



>Number:         51474
>Category:       kern
>Synopsis:       unbound can get stuck in fd_close()
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Sep 13 07:15:00 +0000 2016
>Originator:     he%uninett.no@localhost
>Release:        NetBSD 7.0
>Organization:
	UNINETT AS
>Environment:
System: NetBSD osl-res.uninett.no 7.0 NetBSD 7.0 (GENERIC) #0: Wed Sep 30 14:46:48 CEST 2015  he%osl-res.uninett.no@localhost:/usr/obj/sys/arch/amd64/compile/GENERIC amd64
Architecture: x86_64
Machine: amd64
>Description:
	I'm running unbound on one of my machines, and monitor it
	using collectd.  Collectd uses unbound-control periodically to
	collect stats.  It looks as if unbound in certain conditions
	can become stuck in fd_close().  This has several consequences:

	 * unbound can then no longer be stopped with "sh
	   /etc/rc.d/unbound stop" (it just keeps on printing the PID
	   it's waiting for -- unbound's).

	   It also doesn't respond to "kill " (which by default sends
	   TERM).

	 * Manual use of unbound-control hangs.

	Use of "crash" to dig a bit shows at least the kernel part of
	its stack:

crash> ps
PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
...
1101    12 3   7   1000000   fffffe88714394c0            unbound fdclose
1101    11 3   1        80   fffffe88714398e0            unbound kqueue
1101    10 3   8        80   fffffe88719b3080            unbound kqueue
1101     9 3   5        80   fffffe88719b34a0            unbound kqueue
1101     8 3  15        80   fffffe88719b38c0            unbound kqueue
1101     7 3  10        80   fffffe8872ee6060            unbound kqueue
1101     6 3   9        80   fffffe8872ee6480            unbound kqueue
1101     5 3  14        80   fffffe8872ee68a0            unbound kqueue
1101     4 3   0        80   fffffe88714a9040            unbound kqueue
1101     3 3   7        80   fffffe88714a9460            unbound kqueue
1101     2 3   3        80   fffffe88714a9880            unbound kqueue
1101     1 3  12        80   fffffe8872920020            unbound netio
...
crash> t/t 0t1101
trace: pid 1101 lid 12 at 0xfffffe813ad6dde0
sleepq_block() at sleepq_block+0xa2
cv_wait() at cv_wait+0x9a
fd_close() at fd_close+0x2a2
sys_close() at sys_close+0x1b
syscall() at syscall+0x9a
--- syscall (number 6) ---
7f7ff5e3c62a:
crash> 

load averages:  0.16,  0.19,  0.11;               up 14+15:39:18       09:03:34
79 threads: 76 sleeping, 1 stopped, 2 on CPU
CPU states:  0.3% user,  0.0% nice,  0.8% system,  0.1% interrupt, 98.6% idle
Memory: 7445M Act, 10M Wired, 23M Exec, 883M File, 22G Free
Swap: 32G Total, 32G Free

  PID   LID USERNAME PRI STATE      TIME   WCPU    CPU NAME      COMMAND
 1101     9 unbound   43 CPU/5    441:03  6.93%  6.93% -         unbound
 1101     6 unbound   85 kqueue/9 123:16  2.25%  2.25% -         unbound
 1101     2 unbound   85 kqueue/1 140:56  2.20%  2.20% -         unbound
 1101     7 unbound   85 kqueue/7 100:36  0.29%  0.29% -         unbound
 1101    11 unbound   85 kqueue/1 114:07  0.24%  0.24% -         unbound
 1101     8 unbound   85 kqueue/1 108:10  0.24%  0.24% -         unbound
 1101    10 unbound   85 kqueue/8  88:19  0.20%  0.20% -         unbound
 1101     1 unbound   85 netio/12 102:12  0.00%  0.00% -         unbound
 1101     3 unbound   85 kqueue/6  96:36  0.00%  0.00% -         unbound
 1101     5 unbound   85 kqueue/0  77:38  0.00%  0.00% -         unbound
 1101    12 unbound   85 fdclos/7  70:59  0.00%  0.00% -         unbound
 1101     4 unbound   85 kqueue/2  68:00  0.00%  0.00% -         unbound

	Unbound itself appears to still carry out its main function
	(being a DNS name server), but unbound can now no longer be
	killed or restarted, forcing a reboot of the host OS to
	recover.

	Sending "kill -9" leaves two threads still active, clogging
        up port 53 so unbound can't be restarted without a kernel reboot:

load averages:  0.06,  0.13,  0.10;               up 14+15:43:17       09:07:33
77 threads: 75 sleeping, 1 stopped, 1 on CPU
CPU states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt, 99.9% idle
Memory: 7449M Act, 10M Wired, 24M Exec, 882M File, 22G Free
Swap: 32G Total, 32G Free

  PID   LID USERNAME PRI STATE      TIME   WCPU    CPU NAME      COMMAND
 1101     6 unbound   85 lwpwai/9 123:19  0.00%  0.00% -         unbound
 1101    12 unbound   95 fdclos/7  70:59  0.00%  0.00% -         unbound

        Output from "fstat -p 1101" after this action is:

osl-res# fstat -p 1101
USER     CMD          PID   FD MOUNT       INUM MODE         SZ|DV R/W
unbound  unbound     1101 root /        11993123 drwxr-xr-x     512 r 
unbound  unbound     1101   wd /        11993123 drwxr-xr-x     512 r 
unbound  unbound     1101    0 /        2482956 crw-rw-rw-    null rw
unbound  unbound     1101    1 /        2482956 crw-rw-rw-    null rw
unbound  unbound     1101    2 /        2482956 crw-rw-rw-    null rw
unbound  unbound     1101    3* crypto 0xfffffe8876dfcf18
unbound  unbound     1101    4* internet dgram udp localhost:domain
unbound  unbound     1101    5* internet stream tcp localhost:domain
unbound  unbound     1101    6* internet6 dgram udp localhost:domain
unbound  unbound     1101    7 flags 0x80034<ISTTY,MPSAFE,LOCKSWORK,CLEAN>
unbound  unbound     1101    8 flags 0x80034<ISTTY,MPSAFE,LOCKSWORK,CLEAN>
unbound  unbound     1101    9* pipe 0xfffffe88767bcdc0 -> 0x0 w
unbound  unbound     1101   10* internet6 stream tcp localhost:domain
unbound  unbound     1101   11* internet dgram udp osl-res.uninett.no:domain
unbound  unbound     1101   12* internet stream tcp osl-res.uninett.no:domain
unbound  unbound     1101   13* internet6 dgram udp [2001:700:0:102::ca53]:domain
unbound  unbound     1101   14* internet6 stream tcp [2001:700:0:102::ca53]:domain
unbound  unbound     1101   15* internet dgram udp dns-resolver1.uninett.no:domain
unbound  unbound     1101   16* internet stream tcp dns-resolver1.uninett.no:domain
unbound  unbound     1101   17* internet6 dgram udp dns-resolver1.uninett.no:domain
unbound  unbound     1101   18* internet6 stream tcp dns-resolver1.uninett.no:domain
unbound  unbound     1101   19* unix stream  <-> /var/run/unbound-ctl [creat]
unbound  unbound     1101   20* unix dgram  <-> /var/run/log [using]
unbound  unbound     1101   21* unix stream  0xfffffe83f73a8328 <-> 0xfffffe83f73a82b8
unbound  unbound     1101   22* unix stream  0xfffffe83f73a82b8 <-> 0xfffffe83f73a8328
unbound  unbound     1101   23* unix stream  0xfffffe83f73a8248 <-> 0xfffffe83f73a81d8
unbound  unbound     1101   24* unix stream  0xfffffe83f73a81d8 <-> 0xfffffe83f73a8248
unbound  unbound     1101   25* unix stream  0xfffffe83f73a8168 <-> 0xfffffe83f73a80f8
unbound  unbound     1101   26* unix stream  0xfffffe83f73a80f8 <-> 0xfffffe83f73a8168
unbound  unbound     1101   27* unix stream  0xfffffe83f73a8088 <-> 0xfffffe83f73a8018
unbound  unbound     1101   28* unix stream  0xfffffe83f73a8018 <-> 0xfffffe83f73a8088
unbound  unbound     1101   29* unix stream  0xfffffe8872f3ef70 <-> 0xfffffe8872f3ef00
unbound  unbound     1101   30* unix stream  0xfffffe8872f3ef00 <-> 0xfffffe8872f3ef70
unbound  unbound     1101   31* unix stream  0xfffffe8872f3ee90 <-> 0xfffffe8872f3ee20
unbound  unbound     1101   32* unix stream  0xfffffe8872f3ee20 <-> 0xfffffe8872f3ee90
unbound  unbound     1101   33* unix stream  0xfffffe8872f3edb0 <-> 0xfffffe8872f3ed40
unbound  unbound     1101   34* unix stream  0xfffffe8872f3ed40 <-> 0xfffffe8872f3edb0
unbound  unbound     1101   35* unix stream  0xfffffe8872f3ecd0 <-> 0xfffffe8872f3ec60
unbound  unbound     1101   36* unix stream  0xfffffe8872f3ec60 <-> 0xfffffe8872f3ecd0
unbound  unbound     1101   37* unix stream  0xfffffe8872f3ebf0 <-> 0xfffffe8872f3eb80
unbound  unbound     1101   38* unix stream  0xfffffe8872f3eb80 <-> 0xfffffe8872f3ebf0
unbound  unbound     1101   39* unix stream  0xfffffe8872f3eb10 <-> 0xfffffe8872f3eaa0
unbound  unbound     1101   40* unix stream  0xfffffe8872f3eaa0 <-> 0xfffffe8872f3eb10
unbound  unbound     1101   41* unix stream  0xfffffe8872f3ea30 <-> 0xfffffe8872f3e9c0
unbound  unbound     1101   42* unix stream  0xfffffe8872f3e9c0 <-> 0xfffffe8872f3ea30
unbound  unbound     1101   43* unix stream  0xfffffe8872f3e950 <-> 0xfffffe8872f3e8e0
unbound  unbound     1101   44* unix stream  0xfffffe8872f3e8e0 <-> 0xfffffe8872f3e950
unbound  unbound     1101   45* kqueue pending 8
unbound  unbound     1101   46* kqueue pending 7
unbound  unbound     1101   47* kqueue pending 13
unbound  unbound     1101   48* kqueue pending 8
unbound  unbound     1101   49* kqueue pending 7
unbound  unbound     1101   50* kqueue pending 9
unbound  unbound     1101   51* kqueue pending 9
unbound  unbound     1101   52* kqueue pending 17
unbound  unbound     1101   53* kqueue pending 10
unbound  unbound     1101   54* kqueue pending 8
unbound  unbound     1101   55* kqueue pending 9
unbound  unbound     1101   56* kqueue pending 9
unbound  unbound     1101   57* internet dgram udp *:35674
unbound  unbound     1101   58* internet dgram udp *:10637
unbound  unbound     1101   59* internet dgram udp *:30490
unbound  unbound     1101   60* internet dgram udp *:19156
unbound  unbound     1101   61* internet dgram udp *:31394
unbound  unbound     1101   62* internet dgram udp *:32203
unbound  unbound     1101   63* internet dgram udp *:63713
unbound  unbound     1101   64* unix stream  <-> /var/run/unbound-ctl [creat]
unbound  unbound     1101   65* internet dgram udp *:36459
unbound  unbound     1101   66* internet dgram udp *:56616
unbound  unbound     1101   67* internet dgram udp *:21417
unbound  unbound     1101   68* internet dgram udp *:57035
unbound  unbound     1101   69* internet dgram udp *:22686
unbound  unbound     1101   70* internet dgram udp *:49669
unbound  unbound     1101   71* internet stream tcp
unbound  unbound     1101   72* internet6 dgram udp *:53378
unbound  unbound     1101   73* internet6 dgram udp *:20212
unbound  unbound     1101   74* internet stream tcp
unbound  unbound     1101   76* internet dgram udp *:9932
unbound  unbound     1101   77* internet dgram udp *:12556
unbound  unbound     1101   78* internet dgram udp *:34919
unbound  unbound     1101   79* internet dgram udp *:64843
unbound  unbound     1101   80* internet dgram udp *:43554
unbound  unbound     1101   81* internet dgram udp *:52510
unbound  unbound     1101   82* internet stream tcp
unbound  unbound     1101   83* internet dgram udp *:43543
unbound  unbound     1101   85* internet6 dgram udp *:18430
unbound  unbound     1101   86* internet dgram udp *:11819
unbound  unbound     1101   87* internet dgram udp *:6392
unbound  unbound     1101   88* internet dgram udp *:62296
unbound  unbound     1101   89* internet dgram udp *:65501
unbound  unbound     1101   90* internet dgram udp *:15730
unbound  unbound     1101   91* internet dgram udp *:8304
unbound  unbound     1101   92* internet dgram udp *:31388
unbound  unbound     1101   93* internet dgram udp *:24339
unbound  unbound     1101   94* internet6 dgram udp *:37859
unbound  unbound     1101   95* internet6 dgram udp *:42976
unbound  unbound     1101   96* internet dgram udp *:53401
unbound  unbound     1101   97* internet dgram udp *:36576
unbound  unbound     1101   98* internet dgram udp *:12451
unbound  unbound     1101  100* internet dgram udp *:26113
unbound  unbound     1101  102* internet dgram udp *:41035
unbound  unbound     1101  114* internet6 dgram udp *:14168


>How-To-Repeat:
	Run unbound on 7.0 with relatively light load (a few hundred
	queries/sec).  Monitor it with collectd, watch it periodically
	get stuck in this manner.

>Fix:
	Sorry, don't know.  Hints for debugging this further
	gratefully accepted.



Home | Main Index | Thread Index | Old Index