Current-Users archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
socket/IPsec panic with 4.99.69
With 4.99.69 (i386, 2 cpus), updated Jul 11 12:38 UTC, I got a crash
with the following backtrace:
(gdb) bt
#0 0xc0542247 in cpu_reboot (howto=260, bootstr=0x0) at ./x86/intr.h:142
#1 0xc0499ed8 in panic (fmt=0xc0a0d813 "trap")
at /usr/ANONCVS/src/sys/kern/subr_prf.c:254
#2 0xc0544800 in trap (frame=0xcc852774)
at /usr/ANONCVS/src/sys/arch/i386/i386/trap.c:352
#3 0xc010cb1f in calltrap ()
#4 0xc053ceb1 in db_read_bytes (addr=6, size=4, data=0xcc8527e4
at /usr/ANONCVS/src/sys/arch/i386/i386/db_memrw.c:90
#5 0xc01d0877 in db_get_value (addr=6, size=4, is_signed=false)
at /usr/ANONCVS/src/sys/ddb/db_access.c:62
#6 0xc053d74a in db_stack_trace_print (addr=-863688492, have_addr=true,
count=65535, modif=0xc0a3e372 "", pr=0xc0499cb0 <vprintf+144>)
at /usr/ANONCVS/src/sys/arch/i386/i386/db_trace.c:485
#7 0xc0499ead in panic (fmt=0xc0a0d813 "trap")
at /usr/ANONCVS/src/sys/kern/subr_prf.c:247
#8 0xc0544800 in trap (frame=0xcc852978)
at /usr/ANONCVS/src/sys/arch/i386/i386/trap.c:352
#9 0xc010cb1f in calltrap ()
#10 0xc01ba654 in ipsec4_getpolicybysock (m=0xc3476b00, dir=1, so=0xc37160d0,
error=0xcc852b14) at /usr/ANONCVS/src/sys/netinet6/ipsec.c:436
#11 0xc01baab2 in ipsec4_in_reject_so (m=0xc3476b00, so=0xc37160d0)
at /usr/ANONCVS/src/sys/netinet6/ipsec.c:1828
#12 0xc0157334 in tcp_input (m=0xc3476b00)
at /usr/ANONCVS/src/sys/netinet/tcp_input.c:1231
#13 0xc014f93a in ip_input (m=0xc3476b00)
at /usr/ANONCVS/src/sys/netinet/ip_input.c:1055
#14 0xc014ff0d in ipintr () at /usr/ANONCVS/src/sys/netinet/ip_input.c:473
#15 0xc047e6ec in softint_dispatch (pinned=0xcef65840, s=4)
at /usr/ANONCVS/src/sys/kern/kern_softint.c:498
#16 0xc0100e0d in Xsoftintr ()
#17 0x00000000 in ?? ()
In frame 10, the code was deferencing so->so_pcb to get inp_sp.
Looking at frame 10, print *so gives:
(gdb) print *so
$13 = {so_lock = 0xcc3e4f80, so_cv = {cv_opaque = {0x0, 0xc37160d4},
cv_wmesg = 0xc0a3951e "socket"}, so_type = 1, so_options = 4,
so_linger = 0, so_state = 2097, so_nbio = 0, so_pcb = 0x0,
so_proto = 0xc07e4440, so_head = 0x0, so_onq = 0x0, so_q0 = {
tqh_first = 0x0, tqh_last = 0xc37160fc}, so_q = {tqh_first = 0x0,
tqh_last = 0xc3716104}, so_qe = {tqe_next = 0x0, tqe_prev = 0xc34dc554},
so_q0len = 0, so_qlen = 0, so_qlimit = 0, so_timeo = 0, so_error = 0,
so_aborting = 0, so_pgid = 0, so_oobmark = 0, so_snd = {sb_sel = {
sel_klist = {slh_first = 0x0}, sel_cpu = 0x0, sel_lwp = 0x0,
sel_chain = {sle_next = 0x0}, sel_collision = 0, sel_reserved = {0, 0,
0}}, sb_mowner = 0x0, sb_so = 0xc37160d0, sb_cv = {cv_opaque = {0x0,
0xc3716150}, cv_wmesg = 0xc0a0608e "netio"}, sb_cc = 0, sb_hiwat = 0,
sb_mbcnt = 0, sb_mbmax = 0, sb_lowat = 2048, sb_mb = 0x0, sb_mbtail = 0x0,
sb_lastrecord = 0x0, sb_flags = 2048, sb_timeo = 0, sb_overflowed = 0},
so_rcv = {sb_sel = {sel_klist = {slh_first = 0x0}, sel_cpu = 0xcc3d9680,
sel_lwp = 0x0, sel_chain = {sle_next = 0x0}, sel_collision = 0,
sel_reserved = {0, 0, 0}}, sb_mowner = 0x0, sb_so = 0xc37160d0, sb_cv = {
cv_opaque = {0x0, 0xc37161b0}, cv_wmesg = 0xc0a0608e "netio"},
sb_cc = 0, sb_hiwat = 0, sb_mbcnt = 0, sb_mbmax = 0, sb_lowat = 0,
sb_mb = 0x0, sb_mbtail = 0x0, sb_lastrecord = 0x0, sb_flags = 0,
sb_timeo = 0, sb_overflowed = 0}, so_internal = 0x0, so_upcall = 0,
so_upcallarg = 0x0, so_send = 0xc04bf540 <sobind+96>,
so_receive = 0xc04c0fb0 <fsocreate+240>, so_mowner = 0x0,
so_uidinfo = 0xcda32d54, so_egid = 100, so_cpid = 6901}
and the problem is either that the pcb is null or the so pointer has
been overwritten. so_send and so_receiver don't quite look right, but
the rest of things do. In particular so->so-uidinfo points to a struct
with my uid in it. So it doesn't look like the struct is generally
trashed.
I am inclined to add a check for so->so_pcb being null, with a printf
and error return if so, and see if that fires and the system
subsequently stays up.
I've been running current on this hardware for a long time, and it was
generally stable, and then around 4.99.59?? I had occasional crashes
(never figured out why) and then 4.99.63 was stable, and then 4.99.69 is
crashing again. It could be that there is a socket/IPsec bug that's
been present a long time and the other ones just didn't happen to hit
it. I suspect not enough synchronization surrounding socket/pcb
attachment and detachment, but I haven't looked yet.
Thanks to whomever did sparse core dumps - 256M for a 2G machine is much
easier on the disk, and more likely to fit in free space.
Is there a way to see what the other cpu was doing from the backtrace?
Home |
Main Index |
Thread Index |
Old Index