Subject: Odd crashes in tcp_output (2.0ish)
To: None <tech-net@netbsd.org>
From: Greg Troxel <gdt@ir.bbn.com>
List: tech-net
Date: 01/26/2005 11:43:27
I'm running 2.0_RC4 on i386, and have made a number of changes (added
new AF, wireless driver changes), but the machine in question isn't
doing any of that.  It is a mail and dns server, and has lots of smtp
incoming connections, many of which are throttled by postfix (delaying
answers) or closed due to user unknown in local recipient table.

At least three times, the system has crashed due to callout asserting
from tcp_output.  The other backtraces look similar.

(gdb) bt
#0  0x1fd07000 in ?? ()
#1  0xc037e25b in cpu_reboot (howto=256, bootstr=0x0)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/arch/i386/i386/machdep.c:748
#2  0xc02fac98 in panic (
    fmt=0xc05cc9c0 "kernel %sassertion \"%s\" failed: file \"%s\", line %d")
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/kern/subr_prf.c:242
#3  0xc04b460c in __assert (t=0xc0550037 "diagnostic ", 
    f=0xc059e800 "/usr/home/gdt/SINEW-current/netbsd/src/sys/kern/kern_timeout.c", l=281, e=0xc055bcd5 "to_ticks >= 0")
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/lib/libkern/__assert.c:47
#4  0xc02f3b15 in callout_schedule (c=0xc173973c, to_ticks=-350)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/kern/kern_timeout.c:301
#5  0xc0120741 in tcp_output (tp=0xc1739730)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/netinet/tcp_output.c:1119
#6  0xc012506f in tcp_disconnect (tp=0xc1739730)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/netinet/tcp_usrreq.c:860
#7  0xc0124937 in tcp_usrreq (so=0xc191b970, req=6, m=0x0, nam=0x0, 
    control=0x0, p=0x0)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/netinet/tcp_usrreq.c:457
#8  0xc031001b in sodisconnect (so=0xc191b970)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/kern/uipc_socket.c:695
#9  0xc030fda9 in soclose (so=0xc191b970)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/kern/uipc_socket.c:575
#10 0xc0301e8d in soo_close (fp=0xca8e3820, p=0xcbf7d00c)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/kern/sys_socket.c:220
#11 0xc02d6137 in closef (fp=0xca8e3820, p=0xcbf7d00c)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/kern/kern_descrip.c:1348
#12 0xc03860ee in syscall_plain (frame=0xcb815fa8)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/arch/i386/i386/syscall.c:156

print *tp
$9 = {t_family = 2, segq = {tqh_first = 0x0, tqh_last = 0xc1739734}, 
  t_timer = {{c_list = {cq_next = {elem = 0xc0604a90, list = 0xc0604a90}, 
        cq_prev = {elem = 0xc0604a90, list = 0xc0604a90}}, 
      c_func = 0xc0123fb0 <tcp_timer_rexmt>, c_arg = 0xc1739730, 
      c_time = 42032674, c_flags = 0}, {c_list = {cq_next = {
          elem = 0xc0605478, list = 0xc0605478}, cq_prev = {elem = 0xc1739af0, 
          list = 0xc1739af0}}, c_func = 0xc0124220 <tcp_timer_persist>, 
      c_arg = 0xc1739730, c_time = 42033067, c_flags = 0}, {c_list = {
        cq_next = {elem = 0xc0605550, list = 0xc0605550}, cq_prev = {
          elem = 0xc0605550, list = 0xc0605550}}, 
      c_func = 0xc0124314 <tcp_timer_keep>, c_arg = 0xc1739730, 
      c_time = 42752580, c_flags = 2}, {c_list = {cq_next = {elem = 0x0, 
          list = 0x0}, cq_prev = {elem = 0x0, list = 0x0}}, 
      c_func = 0xc012449c <tcp_timer_2msl>, c_arg = 0xc1739730, c_time = 0, 
      c_flags = 0}}, t_state = 8, t_rxtshift = 0, t_rxtcur = 4294967289, 
  t_dupacks = 0, t_peermss = 512, t_ourmss = 1460, t_segsz = 512, 
  t_force = 0 '\0', t_flags = 14500, t_template = 0xc1239f00, 
  t_inpcb = 0xc17a4f38, t_in6pcb = 0x0, t_delack_ch = {c_list = {cq_next = {
        elem = 0x0, list = 0x0}, cq_prev = {elem = 0x0, list = 0x0}}, 
    c_func = 0, c_arg = 0x0, c_time = 0, c_flags = 0}, snd_una = 3824733042, 
  snd_nxt = 3824733043, snd_up = 3824733042, snd_wl1 = 4244627348, 
  snd_wl2 = 3824733042, iss = 3824732535, snd_wnd = 6432, 
  snd_recover = 3824733042, rcv_wnd = 33580, rcv_nxt = 4244627349, 
  rcv_up = 4244626743, irs = 4244626742, rcv_adv = 4244660324, 
  snd_max = 3824733043, snd_cwnd = 2560, snd_ssthresh = 1073725440, 
  t_rcvtime = 840642, t_rtttime = 840642, t_rtseq = 3824733042, t_srtt = -288, 
  t_rttvar = -72, t_rttmin = 2, max_sndwnd = 6432, t_oobflags = 0 '\0', 
  t_iobc = 0 '\0', t_softerror = 0, snd_scale = 0 '\0', rcv_scale = 0 '\0', 
  request_r_scale = 0 '\0', requested_s_scale = 0 '\0', ts_recent = 0, 
  ts_recent_age = 0, ts_timebase = 840642, last_ack_sent = 4244627349, 
  timeq = {tqh_first = 0x0, tqh_last = 0xc173984c}, t_mtudisc = 1, t_sc = {
    lh_first = 0x0}, t_lastm = 0x0, t_inoff = 0, t_lastoff = -506, 
  t_lastlen = 506}

Things to note: snd_nxt is one more than snd_una.  Presumably a FIN
was just sent and we are trying to set a timer to wait for not getting
an ack.  But srtt is -288, rttvar is -72.  and rxtcur is unsigned, but
it looks like -7:

printf "%x\n", tp->t_rxtcur
fffffff9

t_flags is 38a4, which looks ok to me.

The 2msl timer is shown as invoking (bit 8 in callout flags), but the
backtrace doesn't show that.

So callout_schedule was passed -350, which is 50 times -7.

Any clues?  Is this a trashed tcp control block, or did a negative rtt
creep in somehow?  Should the RTT-computing logic check for
preposterous RTTs (takes unsigned, but as far as I can tell must
tcp_xmit_timer have got -9 as an argument.

Looking for ways to get a negative srtt, this call in tcp_input.c is
suspicious:

    if (opti.ts_present && opti.ts_ecr)
        tcp_xmit_timer(tp, TCP_TIMESTAMP(tp) - opti.ts_ecr + 1);

TCP_TIMESTAMP is tcp_now minus tcp_timebase.  In this case, they are
the same.  It is not checked that the remote side's timestamp echo is
between 0 and TCP_TIMESTAMP-1, so this could perhaps underflow.