socket/IPsec panic with 4.99.69

To: current-users%netbsd.org@localhost
Subject: socket/IPsec panic with 4.99.69
From: Greg Troxel <gdt%ir.bbn.com@localhost>
Date: Sun, 13 Jul 2008 08:33:52 -0400

With 4.99.69 (i386, 2 cpus), updated Jul 11 12:38 UTC, I got a crash
with the following backtrace:

(gdb) bt
#0  0xc0542247 in cpu_reboot (howto=260, bootstr=0x0) at ./x86/intr.h:142
#1  0xc0499ed8 in panic (fmt=0xc0a0d813 "trap")
    at /usr/ANONCVS/src/sys/kern/subr_prf.c:254
#2  0xc0544800 in trap (frame=0xcc852774)
    at /usr/ANONCVS/src/sys/arch/i386/i386/trap.c:352
#3  0xc010cb1f in calltrap ()
#4  0xc053ceb1 in db_read_bytes (addr=6, size=4, data=0xcc8527e4
    at /usr/ANONCVS/src/sys/arch/i386/i386/db_memrw.c:90
#5  0xc01d0877 in db_get_value (addr=6, size=4, is_signed=false)
    at /usr/ANONCVS/src/sys/ddb/db_access.c:62
#6  0xc053d74a in db_stack_trace_print (addr=-863688492, have_addr=true, 
    count=65535, modif=0xc0a3e372 "", pr=0xc0499cb0 <vprintf+144>)
    at /usr/ANONCVS/src/sys/arch/i386/i386/db_trace.c:485
#7  0xc0499ead in panic (fmt=0xc0a0d813 "trap")
    at /usr/ANONCVS/src/sys/kern/subr_prf.c:247
#8  0xc0544800 in trap (frame=0xcc852978)
    at /usr/ANONCVS/src/sys/arch/i386/i386/trap.c:352
#9  0xc010cb1f in calltrap ()
#10 0xc01ba654 in ipsec4_getpolicybysock (m=0xc3476b00, dir=1, so=0xc37160d0, 
    error=0xcc852b14) at /usr/ANONCVS/src/sys/netinet6/ipsec.c:436
#11 0xc01baab2 in ipsec4_in_reject_so (m=0xc3476b00, so=0xc37160d0)
    at /usr/ANONCVS/src/sys/netinet6/ipsec.c:1828
#12 0xc0157334 in tcp_input (m=0xc3476b00)
    at /usr/ANONCVS/src/sys/netinet/tcp_input.c:1231
#13 0xc014f93a in ip_input (m=0xc3476b00)
    at /usr/ANONCVS/src/sys/netinet/ip_input.c:1055
#14 0xc014ff0d in ipintr () at /usr/ANONCVS/src/sys/netinet/ip_input.c:473
#15 0xc047e6ec in softint_dispatch (pinned=0xcef65840, s=4)
    at /usr/ANONCVS/src/sys/kern/kern_softint.c:498
#16 0xc0100e0d in Xsoftintr ()
#17 0x00000000 in ?? ()

In frame 10, the code was deferencing so->so_pcb to get inp_sp.

Looking at frame 10, print *so gives:

(gdb) print *so
$13 = {so_lock = 0xcc3e4f80, so_cv = {cv_opaque = {0x0, 0xc37160d4}, 
    cv_wmesg = 0xc0a3951e "socket"}, so_type = 1, so_options = 4, 
  so_linger = 0, so_state = 2097, so_nbio = 0, so_pcb = 0x0, 
  so_proto = 0xc07e4440, so_head = 0x0, so_onq = 0x0, so_q0 = {
    tqh_first = 0x0, tqh_last = 0xc37160fc}, so_q = {tqh_first = 0x0, 
    tqh_last = 0xc3716104}, so_qe = {tqe_next = 0x0, tqe_prev = 0xc34dc554}, 
  so_q0len = 0, so_qlen = 0, so_qlimit = 0, so_timeo = 0, so_error = 0, 
  so_aborting = 0, so_pgid = 0, so_oobmark = 0, so_snd = {sb_sel = {
      sel_klist = {slh_first = 0x0}, sel_cpu = 0x0, sel_lwp = 0x0, 
      sel_chain = {sle_next = 0x0}, sel_collision = 0, sel_reserved = {0, 0, 
        0}}, sb_mowner = 0x0, sb_so = 0xc37160d0, sb_cv = {cv_opaque = {0x0, 
        0xc3716150}, cv_wmesg = 0xc0a0608e "netio"}, sb_cc = 0, sb_hiwat = 0, 
    sb_mbcnt = 0, sb_mbmax = 0, sb_lowat = 2048, sb_mb = 0x0, sb_mbtail = 0x0, 
    sb_lastrecord = 0x0, sb_flags = 2048, sb_timeo = 0, sb_overflowed = 0}, 
  so_rcv = {sb_sel = {sel_klist = {slh_first = 0x0}, sel_cpu = 0xcc3d9680, 
      sel_lwp = 0x0, sel_chain = {sle_next = 0x0}, sel_collision = 0, 
      sel_reserved = {0, 0, 0}}, sb_mowner = 0x0, sb_so = 0xc37160d0, sb_cv = {
      cv_opaque = {0x0, 0xc37161b0}, cv_wmesg = 0xc0a0608e "netio"}, 
    sb_cc = 0, sb_hiwat = 0, sb_mbcnt = 0, sb_mbmax = 0, sb_lowat = 0, 
    sb_mb = 0x0, sb_mbtail = 0x0, sb_lastrecord = 0x0, sb_flags = 0, 
    sb_timeo = 0, sb_overflowed = 0}, so_internal = 0x0, so_upcall = 0, 
  so_upcallarg = 0x0, so_send = 0xc04bf540 <sobind+96>, 
  so_receive = 0xc04c0fb0 <fsocreate+240>, so_mowner = 0x0, 
  so_uidinfo = 0xcda32d54, so_egid = 100, so_cpid = 6901}

and the problem is either that the pcb is null or the so pointer has
been overwritten.  so_send and so_receiver don't quite look right, but
the rest of things do.  In particular so->so-uidinfo points to a struct
with my uid in it.  So it doesn't look like the struct is generally
trashed.

I am inclined to add a check for so->so_pcb being null, with a printf
and error return if so, and see if that fires and the system
subsequently stays up.

I've been running current on this hardware for a long time, and it was
generally stable, and then around 4.99.59?? I had occasional crashes
(never figured out why) and then 4.99.63 was stable, and then 4.99.69 is
crashing again.  It could be that there is a socket/IPsec bug that's
been present a long time and the other ones just didn't happen to hit
it.  I suspect not enough synchronization surrounding socket/pcb
attachment and detachment, but I haven't looked yet.

Thanks to whomever did sparse core dumps - 256M for a 2G machine is much
easier on the disk, and more likely to fit in free space.

Is there a way to see what the other cpu was doing from the backtrace?

Follow-Ups:
- Re: socket/IPsec panic with 4.99.69
  - From: Michael van Elst

Prev by Date: Re: failed build package
Next by Date: Re: failed build package
Previous by Thread: failed build package
Next by Thread: Re: socket/IPsec panic with 4.99.69
Indexes:

Home | Main Index | Thread Index | Old Index