Subject: kern/22002: panic: double tcp_freeq() may happen - TAILQ_* error check fails
To: None <gnats-bugs@gnats.netbsd.org>
From: None <he@nordu.net>
List: netbsd-bugs
Date: 06/27/2003 17:27:29
>Number:         22002
>Category:       kern
>Synopsis:       panic: double tcp_freeq() may happen - TAILQ_* error check fails
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Jun 27 15:28:00 UTC 2003
>Closed-Date:
>Last-Modified:
>Originator:     Havard Eidnes
>Release:        NetBSD 1.6.1_RC1 + stability workarounds for PR#20390
>Organization:
	NORDUnet
>Environment:
System: NetBSD black-3.sth.netnod.se 1.6.1_RC1 NetBSD 1.6.1_RC1 (BLACKHOLE) #30: Fri Mar 14 20:34:33 MET 2003     he@black-3.sth.netnod.se:/usr/src/sys/arch/i386/compile/BLACKHOLE i386
Architecture: i386
Machine: i386
>Description:
	This system which sees widespread and a relatively high load
	of TCP connections has been running stable for 3 months, and
	then decided to panic due to an internal consistency check in
	the TAILQ_* macros.

	Details from the debugging sessions follows:

panic: TAILQ_* back 0xc0867b04 /sys/arch/i386/compile/BLACKHOLE/../../../../netinet/tcp_subr.c:1181
Stopped in pid 115 (named) at   cpu_Debugger+0x4:       leave
db> 
db> 
db> trace
cpu_Debugger(c0867b04,c089f164,f000efe6,d2ec0b6c,c018045d) at cpu_Debugger+0x4
panic(c0266a76,c0867b04,c0266a20,49d,c089f164) at panic+0xad
tcp_freeq(c089f164,c0000000,a1dd,c0dc1500) at tcp_freeq+0x50
tcp_drain(d000ef64,c02c8520,c02c88c0,c0814690,3ef46298) at tcp_drain+0x90
m_reclaim(0,0,d2ec0cdc,c0da8000,d000ef64) at m_reclaim+0x39
pool_get(c02c88c0,0,c0dc1500,d2ec0c4c,d000ef64) at pool_get+0x27
pool_cache_get(c02c8520,0,c0d2bb00,0,800) at pool_cache_get+0x3b
fxp_add_rfabuf(c079a000,c07ac9c0,1,4) at fxp_add_rfabuf+0xda
fxp_rxintr(c079a000,0,0,a7e90210) at fxp_rxintr+0x58f
fxp_intr(c079a000) at fxp_intr+0xca
Xintr10() at Xintr10+0x82
--- interrupt ---
pool_do_put(c02ca700,c089f164,c0b796c8,c02cfde0,c0000000) at pool_do_put+0x35e
pool_put(c02ca700,c089f164,d2ec0dd0,c0158bee) at pool_put+0x14
tcp_close(c089f164,0,c01c051c,c089f164,130afc0) at tcp_close+0x29a
tcp_input(c09e6000,14,6,1,c09e6000) at tcp_input+0x25bb
ip_input(c09e6000,0,bfbfda9c,0) at ip_input+0x5f2
ipintr(1f,1f,1f,1f,bfbfc07c) at ipintr+0x6b
Xsoftnet() at Xsoftnet+0x2c
--- interrupt ---
0x480eeec0:
db>  
db> ps
 PID             PPID       PGRP        UID S   FLAGS          COMMAND    WAIT
 236                1        236          0 3  0x4086            getty   ttyin
 235                1        235          0 3  0x4086            getty   ttyin
 234                1        234          0 3  0x4086            getty   ttyin
 233                1        233          0 3  0x4086            getty   ttyin
 231                1        231          0 3    0x84             cron nanosle
 215                1        215          0 3    0x84         sendmail   pause
 197                1        197          0 3    0x84             sshd  select
 171                1        171          0 3    0x84             bgpd  select
 169                1        169          0 3    0x84            zebra  select
*115                1        115          0 7     0x4            named
 104                1        104          0 3    0x84          syslogd  select
 9                  0          0          0 3 0x20204         aiodoned aiodone
 8                  0          0          0 3 0x20204          ioflush  syncer
 7                  0          0          0 3 0x20204           reaper  reaper
 6                  0          0          0 3 0x20204       pagedaemon pgdaemo
 5                  0          0          0 3 0x20204             usb1  usbevt
 4                  0          0          0 3 0x20204          usbtask  usbtsk
 3                  0          0          0 3 0x20204             usb0  usbevt
 2                  0          0          0 3 0x20204        atapibus0  sccomp
 1                  0          1          0 3  0x4084             init    wait
 0                 -1          0          0 3 0x20204          swapper schedul
db>  
db> x/x 0xc0867b04
0xc0867b04:     c089f164
db> x/x,20
0xc0867b04:     c089f164    c089f164    0           c08b7480    b           c089
f000    3ef40783    dbba0       c0867b74    c0867a5c    0           c0867b2c
0xc0867b34:     c086755c    c0867f44    10          c08a1000    0           0
0xc0867b4c:     c092f028    c0a70914    c08a28d0    c08a2658    c0867584    c08b
7188    0           c08a2000    3ef45737    c5c10       c0867b9c    c0867b24
0xc0867b7c:     0           c0867b7c
db> reboot 104
Dumping etc. etc.

black-3# gdb netbsd.gdb
GNU gdb 5.0nb1
Copyright 2000 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386--netbsdelf"...
(gdb) target kcore /var/crash/netbsd.22.core
panic: TAILQ_* back %p %s:%d
#0  0x1 in ?? ()
(gdb) where
#0  0x1 in ?? ()
#1  0xc01f4c17 in cpu_reboot (howto=260, bootstr=0x0)
    at /sys/arch/i386/compile/BLACKHOLE/../../../../arch/i386/i386/machdep.c:223
6
#2  0xc0148a11 in db_reboot_cmd ()
    at /sys/arch/i386/compile/BLACKHOLE/../../../../ddb/db_command.c:669
#3  0xc01486ec in db_command (last_cmdp=0xc02a3634, cmd_table=0xc025cd00)
    at /sys/arch/i386/compile/BLACKHOLE/../../../../ddb/db_command.c:456
#4  0xc01482eb in db_command_loop ()
    at /sys/arch/i386/compile/BLACKHOLE/../../../../ddb/db_command.c:246
#5  0xc014bdc4 in db_trap (type=1, code=0)
    at /sys/arch/i386/compile/BLACKHOLE/../../../../ddb/db_trap.c:92
#6  0xc01f1b7e in kdb_trap (type=1, code=0, regs=0xd2ec0ad8)
    at /sys/arch/i386/compile/BLACKHOLE/../../../../arch/i386/i386/db_interface.
c:129
#7  0xc01f9bb3 in trap (frame={tf_gs = 16, tf_fs = 16, tf_es = 16, tf_ds = 16, 
      tf_edi = 256, tf_esi = -1071224202, tf_ebp = -756282600, 
      tf_ebx = -756282556, tf_edx = -1071246029, tf_ecx = 26240, 
      tf_eax = 3453, tf_trapno = 1, tf_err = 0, tf_eip = -1071702972, 
      tf_cs = 8, tf_eflags = 514, tf_esp = -756282568, tf_ss = -1072245431, 
      tf_vm86_es = -1064928508, tf_vm86_ds = -1064701596, 
      tf_vm86_fs = -268374042, tf_vm86_gs = -756282516})
    at /sys/arch/i386/compile/BLACKHOLE/../../../../arch/i386/i386/trap.c:220
---Type <return> to continue, or q <return> to quit---
#8  0xc0100bfb in calltrap ()
#9  0xc016d549 in panic (fmt=0xc0266a76 "TAILQ_* back %p %s:%d")
    at /sys/arch/i386/compile/BLACKHOLE/../../../../kern/subr_prf.c:237
#10 0xc01bf150 in tcp_freeq (tp=0xc089f164)
    at /sys/arch/i386/compile/BLACKHOLE/../../../../netinet/tcp_subr.c:1181
#11 0xc01bf2a0 in tcp_drain ()
    at /sys/arch/i386/compile/BLACKHOLE/../../../../netinet/tcp_subr.c:1213
#12 0xc017b01d in m_reclaim (arg=0x0, flags=0)
    at /sys/arch/i386/compile/BLACKHOLE/../../../../kern/uipc_mbuf.c:258
#13 0xc016b9f7 in pool_get (pp=0xc02c88c0, flags=0)
    at /sys/arch/i386/compile/BLACKHOLE/../../../../kern/subr_pool.c:672
#14 0xc016ce77 in pool_cache_get (pc=0xc02c8520, flags=0)
    at /sys/arch/i386/compile/BLACKHOLE/../../../../kern/subr_pool.c:1683
#15 0xc010a0be in fxp_add_rfabuf (sc=0xc079a000, rxmap=0xc07ac9c0, unload=1)
    at /sys/arch/i386/compile/BLACKHOLE/../../../../dev/ic/i82557.c:1824
#16 0xc0108c0f in fxp_rxintr (sc=0xc079a000)
    at /sys/arch/i386/compile/BLACKHOLE/../../../../dev/ic/i82557.c:1205
#17 0xc0108462 in fxp_intr (arg=0xc079a000)
    at /sys/arch/i386/compile/BLACKHOLE/../../../../dev/ic/i82557.c:1037
#18 0xc0101799 in Xintr10 ()
#19 0xc0000000 in ?? ()
#20 0xc070e780 in ?? ()
can not access 0x3, invalid translation (invalid PDE)
can not access 0x3, invalid translation (invalid PDE)
Cannot access memory at address 0x3
(gdb) up 10
#10 0xc01bf150 in tcp_freeq (tp=0xc089f164)
    at /sys/arch/i386/compile/BLACKHOLE/../../../../netinet/tcp_subr.c:1181
1181                    TAILQ_REMOVE(&tp->segq, qe, ipqe_q);
(gdb) p tp
$1 = (struct tcpcb *) 0xc089f164
(gdb) p $.segq.tqh_first
$3 = (struct ipqent *) 0xc0867b04
(gdb) p *$
$4 = {ipqe_q = {tqe_next = 0xc089f164, tqe_prev = 0xc089f164}, _ipqe_u1 = {
    _ip = 0x0, _tcp = 0x0}, ipqe_m = 0xc08b7480, ipre_mlast = 0xb, 
  ipqe_mff = 0 '\000', ipqe_timeq = {tqe_next = 0x3ef40783, 
    tqe_prev = 0xdbba0}, ipqe_seq = 3230038900, ipqe_len = 3230038620, 
  ipqe_flags = 0}
(gdb) p/x tp->t_flags
$8 = 0x4000
(gdb) p qe
$10 = (struct ipqent *) 0xc0867b04
(gdb)

	It seems that the QUEUEDEBUG_TAILQ_OP check in TAILQ_REMOVE is
	"correct" in that something has messed up this particular
	queue, because the "prev" pointer for the next element doesn't
	point back towards the element at the head of the queue.

	...

	However, note that 0xc089f164 as found in qe->tqe_next in
	frame 10 is the TCPCB which is being closed when the fxp
	interrupt arrives (according to the DDB output; gdb really
	should know how to trace through an interrupt frame...), and
	it appears also to be the argument to the tcp_freeq() function
	called from tcp_drain().  However, at this particular point in
	tcp_close():

(gdb) x/i tcp_close+0x29a
0xc01bf09a <tcp_close+666>:     add    $0x10,%esp
(gdb) i li *0xc01bf09a
Line 1145 of "/sys/arch/i386/compile/BLACKHOLE/../../../../netinet/tcp_subr.c"
   starts at address 0xc01bf08c <tcp_close+652>
   and ends at 0xc01bf09d <tcp_close+669>.
(gdb) l 1132
1127            }
1128    #endif /* RTV_RTT */
1129            /* free the reassembly queue, if any */
1130            TCP_REASS_LOCK(tp);
1131            (void) tcp_freeq(tp);
1132            TCP_REASS_UNLOCK(tp);
1133    
1134            tcp_canceltimers(tp);
1135            TCP_CLEAR_DELACK(tp);
1136            syn_cache_cleanup(tp);
(gdb) l
1137    
1138            if (tp->t_template) {
1139                    m_free(tp->t_template);
1140                    tp->t_template = NULL;
1141            }
1142            if (tcp_timers_running(tp))
1143                    tp->t_flags |= TF_DEAD;
1144            else
1145                    pool_put(&tcpcb_pool, tp);
1146    
(gdb) 

	...tcp_freeq() has already been called with this particular
	TCPCB as an argument.  It is perhaps not surprising that a
	second attempt to free the same data structure will fail,
	although I don't claim to understand why the data structure
	isn't marked as "empty" (TAILQ_FIRST(&tp->segq)) == NULL) when
	the data structure has been released.

	This does appear to happen only once every blue moon; at the
	time of panic the system had had over 3 months uptime.

>How-To-Repeat:
	Reproducing this problem is going to be somewhat difficult,
	however, I think the debugging session above should be
	sufficient for someone knowledgeable to come up with a
	suitable fix(?)

>Fix:
	Explicitly mark queues as empty when they have been released
	in tcp_freeq()?
>Release-Note:
>Audit-Trail:
>Unformatted: