Subject: kern/22002: panic: double tcp_freeq() may happen - TAILQ_* error check fails
To: None <gnats-bugs@gnats.netbsd.org>
From: None <he@nordu.net>
List: netbsd-bugs
Date: 06/27/2003 17:27:29
>Number: 22002
>Category: kern
>Synopsis: panic: double tcp_freeq() may happen - TAILQ_* error check fails
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Jun 27 15:28:00 UTC 2003
>Closed-Date:
>Last-Modified:
>Originator: Havard Eidnes
>Release: NetBSD 1.6.1_RC1 + stability workarounds for PR#20390
>Organization:
NORDUnet
>Environment:
System: NetBSD black-3.sth.netnod.se 1.6.1_RC1 NetBSD 1.6.1_RC1 (BLACKHOLE) #30: Fri Mar 14 20:34:33 MET 2003 he@black-3.sth.netnod.se:/usr/src/sys/arch/i386/compile/BLACKHOLE i386
Architecture: i386
Machine: i386
>Description:
This system which sees widespread and a relatively high load
of TCP connections has been running stable for 3 months, and
then decided to panic due to an internal consistency check in
the TAILQ_* macros.
Details from the debugging sessions follows:
panic: TAILQ_* back 0xc0867b04 /sys/arch/i386/compile/BLACKHOLE/../../../../netinet/tcp_subr.c:1181
Stopped in pid 115 (named) at cpu_Debugger+0x4: leave
db>
db>
db> trace
cpu_Debugger(c0867b04,c089f164,f000efe6,d2ec0b6c,c018045d) at cpu_Debugger+0x4
panic(c0266a76,c0867b04,c0266a20,49d,c089f164) at panic+0xad
tcp_freeq(c089f164,c0000000,a1dd,c0dc1500) at tcp_freeq+0x50
tcp_drain(d000ef64,c02c8520,c02c88c0,c0814690,3ef46298) at tcp_drain+0x90
m_reclaim(0,0,d2ec0cdc,c0da8000,d000ef64) at m_reclaim+0x39
pool_get(c02c88c0,0,c0dc1500,d2ec0c4c,d000ef64) at pool_get+0x27
pool_cache_get(c02c8520,0,c0d2bb00,0,800) at pool_cache_get+0x3b
fxp_add_rfabuf(c079a000,c07ac9c0,1,4) at fxp_add_rfabuf+0xda
fxp_rxintr(c079a000,0,0,a7e90210) at fxp_rxintr+0x58f
fxp_intr(c079a000) at fxp_intr+0xca
Xintr10() at Xintr10+0x82
--- interrupt ---
pool_do_put(c02ca700,c089f164,c0b796c8,c02cfde0,c0000000) at pool_do_put+0x35e
pool_put(c02ca700,c089f164,d2ec0dd0,c0158bee) at pool_put+0x14
tcp_close(c089f164,0,c01c051c,c089f164,130afc0) at tcp_close+0x29a
tcp_input(c09e6000,14,6,1,c09e6000) at tcp_input+0x25bb
ip_input(c09e6000,0,bfbfda9c,0) at ip_input+0x5f2
ipintr(1f,1f,1f,1f,bfbfc07c) at ipintr+0x6b
Xsoftnet() at Xsoftnet+0x2c
--- interrupt ---
0x480eeec0:
db>
db> ps
PID PPID PGRP UID S FLAGS COMMAND WAIT
236 1 236 0 3 0x4086 getty ttyin
235 1 235 0 3 0x4086 getty ttyin
234 1 234 0 3 0x4086 getty ttyin
233 1 233 0 3 0x4086 getty ttyin
231 1 231 0 3 0x84 cron nanosle
215 1 215 0 3 0x84 sendmail pause
197 1 197 0 3 0x84 sshd select
171 1 171 0 3 0x84 bgpd select
169 1 169 0 3 0x84 zebra select
*115 1 115 0 7 0x4 named
104 1 104 0 3 0x84 syslogd select
9 0 0 0 3 0x20204 aiodoned aiodone
8 0 0 0 3 0x20204 ioflush syncer
7 0 0 0 3 0x20204 reaper reaper
6 0 0 0 3 0x20204 pagedaemon pgdaemo
5 0 0 0 3 0x20204 usb1 usbevt
4 0 0 0 3 0x20204 usbtask usbtsk
3 0 0 0 3 0x20204 usb0 usbevt
2 0 0 0 3 0x20204 atapibus0 sccomp
1 0 1 0 3 0x4084 init wait
0 -1 0 0 3 0x20204 swapper schedul
db>
db> x/x 0xc0867b04
0xc0867b04: c089f164
db> x/x,20
0xc0867b04: c089f164 c089f164 0 c08b7480 b c089
f000 3ef40783 dbba0 c0867b74 c0867a5c 0 c0867b2c
0xc0867b34: c086755c c0867f44 10 c08a1000 0 0
0xc0867b4c: c092f028 c0a70914 c08a28d0 c08a2658 c0867584 c08b
7188 0 c08a2000 3ef45737 c5c10 c0867b9c c0867b24
0xc0867b7c: 0 c0867b7c
db> reboot 104
Dumping etc. etc.
black-3# gdb netbsd.gdb
GNU gdb 5.0nb1
Copyright 2000 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i386--netbsdelf"...
(gdb) target kcore /var/crash/netbsd.22.core
panic: TAILQ_* back %p %s:%d
#0 0x1 in ?? ()
(gdb) where
#0 0x1 in ?? ()
#1 0xc01f4c17 in cpu_reboot (howto=260, bootstr=0x0)
at /sys/arch/i386/compile/BLACKHOLE/../../../../arch/i386/i386/machdep.c:223
6
#2 0xc0148a11 in db_reboot_cmd ()
at /sys/arch/i386/compile/BLACKHOLE/../../../../ddb/db_command.c:669
#3 0xc01486ec in db_command (last_cmdp=0xc02a3634, cmd_table=0xc025cd00)
at /sys/arch/i386/compile/BLACKHOLE/../../../../ddb/db_command.c:456
#4 0xc01482eb in db_command_loop ()
at /sys/arch/i386/compile/BLACKHOLE/../../../../ddb/db_command.c:246
#5 0xc014bdc4 in db_trap (type=1, code=0)
at /sys/arch/i386/compile/BLACKHOLE/../../../../ddb/db_trap.c:92
#6 0xc01f1b7e in kdb_trap (type=1, code=0, regs=0xd2ec0ad8)
at /sys/arch/i386/compile/BLACKHOLE/../../../../arch/i386/i386/db_interface.
c:129
#7 0xc01f9bb3 in trap (frame={tf_gs = 16, tf_fs = 16, tf_es = 16, tf_ds = 16,
tf_edi = 256, tf_esi = -1071224202, tf_ebp = -756282600,
tf_ebx = -756282556, tf_edx = -1071246029, tf_ecx = 26240,
tf_eax = 3453, tf_trapno = 1, tf_err = 0, tf_eip = -1071702972,
tf_cs = 8, tf_eflags = 514, tf_esp = -756282568, tf_ss = -1072245431,
tf_vm86_es = -1064928508, tf_vm86_ds = -1064701596,
tf_vm86_fs = -268374042, tf_vm86_gs = -756282516})
at /sys/arch/i386/compile/BLACKHOLE/../../../../arch/i386/i386/trap.c:220
---Type <return> to continue, or q <return> to quit---
#8 0xc0100bfb in calltrap ()
#9 0xc016d549 in panic (fmt=0xc0266a76 "TAILQ_* back %p %s:%d")
at /sys/arch/i386/compile/BLACKHOLE/../../../../kern/subr_prf.c:237
#10 0xc01bf150 in tcp_freeq (tp=0xc089f164)
at /sys/arch/i386/compile/BLACKHOLE/../../../../netinet/tcp_subr.c:1181
#11 0xc01bf2a0 in tcp_drain ()
at /sys/arch/i386/compile/BLACKHOLE/../../../../netinet/tcp_subr.c:1213
#12 0xc017b01d in m_reclaim (arg=0x0, flags=0)
at /sys/arch/i386/compile/BLACKHOLE/../../../../kern/uipc_mbuf.c:258
#13 0xc016b9f7 in pool_get (pp=0xc02c88c0, flags=0)
at /sys/arch/i386/compile/BLACKHOLE/../../../../kern/subr_pool.c:672
#14 0xc016ce77 in pool_cache_get (pc=0xc02c8520, flags=0)
at /sys/arch/i386/compile/BLACKHOLE/../../../../kern/subr_pool.c:1683
#15 0xc010a0be in fxp_add_rfabuf (sc=0xc079a000, rxmap=0xc07ac9c0, unload=1)
at /sys/arch/i386/compile/BLACKHOLE/../../../../dev/ic/i82557.c:1824
#16 0xc0108c0f in fxp_rxintr (sc=0xc079a000)
at /sys/arch/i386/compile/BLACKHOLE/../../../../dev/ic/i82557.c:1205
#17 0xc0108462 in fxp_intr (arg=0xc079a000)
at /sys/arch/i386/compile/BLACKHOLE/../../../../dev/ic/i82557.c:1037
#18 0xc0101799 in Xintr10 ()
#19 0xc0000000 in ?? ()
#20 0xc070e780 in ?? ()
can not access 0x3, invalid translation (invalid PDE)
can not access 0x3, invalid translation (invalid PDE)
Cannot access memory at address 0x3
(gdb) up 10
#10 0xc01bf150 in tcp_freeq (tp=0xc089f164)
at /sys/arch/i386/compile/BLACKHOLE/../../../../netinet/tcp_subr.c:1181
1181 TAILQ_REMOVE(&tp->segq, qe, ipqe_q);
(gdb) p tp
$1 = (struct tcpcb *) 0xc089f164
(gdb) p $.segq.tqh_first
$3 = (struct ipqent *) 0xc0867b04
(gdb) p *$
$4 = {ipqe_q = {tqe_next = 0xc089f164, tqe_prev = 0xc089f164}, _ipqe_u1 = {
_ip = 0x0, _tcp = 0x0}, ipqe_m = 0xc08b7480, ipre_mlast = 0xb,
ipqe_mff = 0 '\000', ipqe_timeq = {tqe_next = 0x3ef40783,
tqe_prev = 0xdbba0}, ipqe_seq = 3230038900, ipqe_len = 3230038620,
ipqe_flags = 0}
(gdb) p/x tp->t_flags
$8 = 0x4000
(gdb) p qe
$10 = (struct ipqent *) 0xc0867b04
(gdb)
It seems that the QUEUEDEBUG_TAILQ_OP check in TAILQ_REMOVE is
"correct" in that something has messed up this particular
queue, because the "prev" pointer for the next element doesn't
point back towards the element at the head of the queue.
...
However, note that 0xc089f164 as found in qe->tqe_next in
frame 10 is the TCPCB which is being closed when the fxp
interrupt arrives (according to the DDB output; gdb really
should know how to trace through an interrupt frame...), and
it appears also to be the argument to the tcp_freeq() function
called from tcp_drain(). However, at this particular point in
tcp_close():
(gdb) x/i tcp_close+0x29a
0xc01bf09a <tcp_close+666>: add $0x10,%esp
(gdb) i li *0xc01bf09a
Line 1145 of "/sys/arch/i386/compile/BLACKHOLE/../../../../netinet/tcp_subr.c"
starts at address 0xc01bf08c <tcp_close+652>
and ends at 0xc01bf09d <tcp_close+669>.
(gdb) l 1132
1127 }
1128 #endif /* RTV_RTT */
1129 /* free the reassembly queue, if any */
1130 TCP_REASS_LOCK(tp);
1131 (void) tcp_freeq(tp);
1132 TCP_REASS_UNLOCK(tp);
1133
1134 tcp_canceltimers(tp);
1135 TCP_CLEAR_DELACK(tp);
1136 syn_cache_cleanup(tp);
(gdb) l
1137
1138 if (tp->t_template) {
1139 m_free(tp->t_template);
1140 tp->t_template = NULL;
1141 }
1142 if (tcp_timers_running(tp))
1143 tp->t_flags |= TF_DEAD;
1144 else
1145 pool_put(&tcpcb_pool, tp);
1146
(gdb)
...tcp_freeq() has already been called with this particular
TCPCB as an argument. It is perhaps not surprising that a
second attempt to free the same data structure will fail,
although I don't claim to understand why the data structure
isn't marked as "empty" (TAILQ_FIRST(&tp->segq)) == NULL) when
the data structure has been released.
This does appear to happen only once every blue moon; at the
time of panic the system had had over 3 months uptime.
>How-To-Repeat:
Reproducing this problem is going to be somewhat difficult,
however, I think the debugging session above should be
sufficient for someone knowledgeable to come up with a
suitable fix(?)
>Fix:
Explicitly mark queues as empty when they have been released
in tcp_freeq()?
>Release-Note:
>Audit-Trail:
>Unformatted: