Subject: kern/29580: destruction of ppp cloners leads to crash
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: None <gdt@ir.bbn.com>
List: netbsd-bugs
Date: 03/02/2005 18:56:00
>Number:         29580
>Category:       kern
>Synopsis:       destruction of ppp cloners leads to crash
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Mar 02 18:56:00 +0000 2005
>Originator:     Greg Troxel
>Release:        NetBSD 2.99.15 from 20050202
>Organization:
        Greg Troxel <gdt@ir.bbn.com>
>Environment:
System: NetBSD foobar.ir.bbn.com 2.99.15 NetBSD 2.99.15 (SINEW) #5: Wed Feb 9 16:45:55 EST 2005 root@bazbam.ir.bbn.com:/n0/obj/sinew/gdt/i386/sys/arch/i386/compile/SINEW i386
Architecture: i386
Machine: i386
>Description:

Two systems run pppd(8), and once a day one end brings down the link
and reestablishes it (to work around a buggy PBX that drops long calls
at inconvenient times).  The system runs quagga, with ospfd and
ripngd.  The system has frequently paniced around the link down/up
time.

One crash occurred when apparently trying to leave a group that had
been joined on the ppp interface.  The ifp had if_softc and others set
to 0xdeadbeef, indicating it had been freed, and on a similar crash
the other end had pppioctl as if_ioctl.  But, the struct in_multi
still had an ifp reference.

Other crashes have occured in in6_selecthlim.  I didn't manage to get
dumps from those, but I suspect that ifp->af_data[AF_INET6] has been
freed and therefore this code:

	else if (ifp)
		return (ND_IFINFO(ifp)->chlim);

dereferences a NULL pointer.  The DDB backtrace looked like the ripng
transmit path (via udp6_output - ripng is straightforward multicast
udp over v6).

The machine that receives the ppp calls also crashes, due to the stray
ifp in multicast memberships.

Long ago, struct ifnets were created and never destroyed.  With the
cloning changes, they are more aggressively freed.  I note that there
is code in sys/net/if.c:if_detach to prune a lot of state that might
reference a struct ifnet that is about to be destroyed, but it seems
that some references remain.

In reading sys/net/if.c:if_detach, observe that the code only calls
PRU_PURGEIF for protocols belonging to address families that are on
the interfaces address list.  This assumes that multicast membership
records in PF x pointing to the ifp can only exist if the ifp has an
address of PF x.  But, if pppd removes the IP address from the
interface as part of a clean shutdown after an LCP TermReq (which it
should do, and used to do without deleting ppp0 under 1.6.2 and 2.0),
there won't be addresses, but still could be joined groups.

pppd removes the v6 address pair with the peers LL address at IPv6CP
close time.  I can't find anything that removes the regular link-local
address.  From reading udp6_output, and what I remember of the
ddb-no-crash-file crash in in6_selecthlim, I think a cached route in a
pcb must have been pointing to a destroyed struct ifnet.

Note in udp6_output that the pktinfo is parsed and checked, and later
used.  This requires being at an elevated priority sufficient to
prevent if_detach from running, and I don't see how this is
guaranteed.  That said, this crash is too frequent for me to believe
this is the problem.


Backtrace:

#4  0xc0102de1 in calltrap ()
#5  0xc010f6db in igmp_sendpkt (inm=0xc0955080, type=23)
    at /n0/gdt/SINEW-current/netbsd/src/sys/netinet/igmp.c:575
#6  0xc010f3b6 in igmp_leavegroup (inm=0xc0955080)
    at /n0/gdt/SINEW-current/netbsd/src/sys/netinet/igmp.c:454
#7  0xc0110ea1 in in_delmulti (inm=0xc0955080)
    at /n0/gdt/SINEW-current/netbsd/src/sys/netinet/in.c:1142
#8  0xc011ad90 in ip_freemoptions (imo=0xc0b29680)
    at /n0/gdt/SINEW-current/netbsd/src/sys/netinet/ip_output.c:1829
#9  0xc0111758 in in_pcbdetach (v=0xc0b4c514)
    at /n0/gdt/SINEW-current/netbsd/src/sys/netinet/in_pcb.c:501
#10 0xc0128379 in udp_usrreq (so=0xc0b53528, req=1, m=0x0, nam=0x0, 
    control=0x0, p=0x0)
    at /n0/gdt/SINEW-current/netbsd/src/sys/netinet/udp_usrreq.c:1059
#11 0xc0332bb6 in soclose (so=0xc0b53528)
    at /n0/gdt/SINEW-current/netbsd/src/sys/kern/uipc_socket.c:604
#12 0xc0324029 in soo_close (fp=0xc8e8a008, p=0xc8a5bb30)
    at /n0/gdt/SINEW-current/netbsd/src/sys/kern/sys_socket.c:238
#13 0xc02f6d0f in closef (fp=0xc8e8a008, p=0xc8a5bb30)
    at /n0/gdt/SINEW-current/netbsd/src/sys/kern/kern_descrip.c:1424
#14 0xc02f6b0b in fdfree (p=0xc8a5bb30)
    at /n0/gdt/SINEW-current/netbsd/src/sys/kern/kern_descrip.c:1290
#15 0xc02fac2d in exit1 (l=0xc80ebce4, rv=15)
    at /n0/gdt/SINEW-current/netbsd/src/sys/kern/kern_exit.c:267
#16 0xc030b234 in postsig (signum=15)
    at /n0/gdt/SINEW-current/netbsd/src/sys/kern/kern_sig.c:1852
#17 0xc03b0a2c in syscall_plain (frame=0xc8b78fa8)
    at /n0/gdt/SINEW-current/netbsd/src/sys/sys/userret.h:93
(gdb) 

(gdb) fr 5
#5  0xc010f6db in igmp_sendpkt (inm=0xc0955080, type=23)
    at /n0/gdt/SINEW-current/netbsd/src/sys/netinet/igmp.c:575
(gdb) print *inm->inm_ifp
$5 = {if_softc = 0xdeadbeef, if_list = {tqe_next = 0xc062a820, 
    tqe_prev = 0xc0bfe000}, if_addrlist = {tqh_first = 0xdeadbeef, 
    tqh_last = 0xdeadbeef}, if_xname = "Ã~C¯Ã~B¾Ã~B­Ã~CÂ~^Ã~C¯Ã~B¾Ã~B­Ã~CÂ~^Ã~C¯Ã~B¾Ã~B­Ã~CÂ~^\0\0\0", if_pcount = 0, 
  if_bpf = 0x0, if_index = 0, if_timer = 0, if_flags = 0, if_extflags = 0, 
  if_data = {ifi_type = 0 '\0', ifi_addrlen = 0 '\0', ifi_hdrlen = 0 '\0', 
    ifi_link_state = 0, ifi_mtu = 0, ifi_metric = 0, ifi_baudrate = 0, 
    ifi_ipackets = 0, ifi_ierrors = 0, ifi_opackets = 0, ifi_oerrors = 0, 
    ifi_collisions = 0, ifi_ibytes = 0, ifi_obytes = 0, ifi_imcasts = 0, 
    ifi_omcasts = 0, ifi_iqdrops = 0, ifi_noproto = 0, ifi_lastchange = {
      tv_sec = 0, tv_usec = 0}}, if_output = 0, if_input = 0, if_start = 0, 
  if_ioctl = 0, if_init = 0, if_stop = 0, if_watchdog = 0, if_drain = 0, 
  if_snd = {ifq_head = 0x0, ifq_tail = 0x0, ifq_len = 0, ifq_maxlen = 0, 
    ifq_drops = 0, altq_type = 0, altq_flags = 0, altq_disc = 0x0, 
    altq_ifp = 0x0, altq_enqueue = 0, altq_dequeue = 0, altq_request = 0, 
    altq_clfier = 0x0, altq_classify = 0, altq_tbr = 0x0, altq_cdnr = 0x0}, 
  if_sadl = 0x0, if_broadcastaddr = 0x0, if_bridge = 0x0, if_dlt = 0, 
  if_pfil = {ph_in = {tqh_first = 0x0, tqh_last = 0x0}, ph_out = {
      tqh_first = 0x0, tqh_last = 0x0}, ph_ifaddr = {tqh_first = 0x0, 
      tqh_last = 0x0}, ph_ifnetevent = {tqh_first = 0x0, tqh_last = 0x0}, 
    ph_type = 0, ph_un = {phu_val = 0, phu_ptr = 0x0}, ph_list = {
      le_next = 0x0, le_prev = 0x0}}, if_capabilities = 0, if_capenable = 0, 
  if_csum_flags_tx = 0, if_csum_flags_rx = 0, if_afdata = {
    0x0 <repeats 33 times>}, if_mowner = 0x0}

Another crash:

#0  0x1fd07000 in ?? ()
#1  0xc03a8923 in cpu_reboot (howto=260, bootstr=0x0)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/arch/i386/i386/machdep.c:754
#2  0xc031cb48 in panic (fmt=0xc05a935f "trap")
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/kern/subr_prf.c:242
#3  0xc03b10e5 in trap (frame=0xcca8f890)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/arch/i386/i386/trap.c:296
#4  0xc0102de1 in calltrap ()
#5  0xc0168585 in ipsec4_in_reject_so (m=0xc184ed00, so=0xc148adb8)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/netinet6/ipsec.c:1825
#6  0xc011aef3 in rip_input (m=0xc184ed00)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/netinet/raw_ip.c:208
#7  0xc0113e53 in ip_input (m=0xc184ed00)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/netinet/ip_input.c:1028
#8  0xc0113866 in ipintr ()
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/netinet/ip_input.c:467
#9  0xc0102aa1 in Xsoftnet ()
#10 0xc03a3e61 in softintr_dispatch (which=0) at x86/intr.h:160
#11 0xc0102af6 in Xsoftclock ()
#12 0xc0341456 in vfs_shutdown () at x86/intr.h:160
#13 0xc03a8937 in cpu_reboot (howto=256, bootstr=0x0)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/arch/i386/i386/machdep.c:740
#14 0xc031cb48 in panic (fmt=0xc05a935f "trap")
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/kern/subr_prf.c:242
#15 0xc03b10e5 in trap (frame=0xcca8fc40)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/arch/i386/i386/trap.c:296
#16 0xc0102de1 in calltrap ()
#17 0xc011ad90 in ip_freemoptions (imo=0xc1421500)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/netinet/ip_output.c:1829
#18 0xc0111758 in in_pcbdetach (v=0xc13ed4a8)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/netinet/in_pcb.c:501
#19 0xc011b77d in rip_usrreq (so=0xc148adb8, req=1, m=0x0, nam=0x0, 
    control=0x0, p=0x0)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/netinet/raw_ip.c:579
#20 0xc0332bb6 in soclose (so=0xc148adb8)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/kern/uipc_socket.c:604
#21 0xc0324029 in soo_close (fp=0xccaf319c, p=0xccdaf00c)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/kern/sys_socket.c:238
#22 0xc02f6d0f in closef (fp=0xccaf319c, p=0xccdaf00c)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/kern/kern_descrip.c:1424
#23 0xc02f6b0b in fdfree (p=0xccdaf00c)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/kern/kern_descrip.c:1290
#24 0xc02fac2d in exit1 (l=0xccaa1320, rv=11)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/kern/kern_exit.c:267
#25 0xc030b234 in postsig (signum=11)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/kern/kern_sig.c:1852
#26 0xc03b1330 in trap (frame=0xcca8ffa8)
    at /usr/home/gdt/SINEW-current/netbsd/src/sys/sys/userret.h:93

print * inp->inp_moptions->imo_membership[2].inm_ifp
$30 = {if_softc = 0xdeadbeef, if_list = {tqe_next = 0xc06236a0, 
    tqe_prev = 0xc1a10400}, if_addrlist = {tqh_first = 0xdeadbeef, 
    tqh_last = 0xdeadbeef}, if_xname = "ï¾­Ã~^ï¾­Ã~^ï¾­Ã~^\0\0\0", if_pcount = 0, 
  if_bpf = 0x0, if_index = 0, if_timer = 0, if_flags = 0, if_extflags = 0, 
  if_data = {ifi_type = 0 '\0', ifi_addrlen = 0 '\0', ifi_hdrlen = 0 '\0', 
    ifi_link_state = 0, ifi_mtu = 0, ifi_metric = 0, ifi_baudrate = 0, 
    ifi_ipackets = 0, ifi_ierrors = 0, ifi_opackets = 0, ifi_oerrors = 0, 
    ifi_collisions = 0, ifi_ibytes = 0, ifi_obytes = 0, ifi_imcasts = 0, 
    ifi_omcasts = 0, ifi_iqdrops = 0, ifi_noproto = 0, ifi_lastchange = {
      tv_sec = 0, tv_usec = 0}}, if_output = 0, if_input = 0, if_start = 0, 
  if_ioctl = 0, if_init = 0, if_stop = 0, if_watchdog = 0, if_drain = 0, 
  if_snd = {ifq_head = 0x0, ifq_tail = 0x0, ifq_len = 0, ifq_maxlen = 0, 
    ifq_drops = 0, altq_type = 0, altq_flags = 0, altq_disc = 0x0, 
    altq_ifp = 0x0, altq_enqueue = 0, altq_dequeue = 0, altq_request = 0, 
    altq_clfier = 0x0, altq_classify = 0, altq_tbr = 0x0, altq_cdnr = 0x0}, 
  if_sadl = 0x0, if_broadcastaddr = 0x0, if_bridge = 0x0, if_dlt = 0, 
  if_pfil = {ph_in = {tqh_first = 0x0, tqh_last = 0x0}, ph_out = {
      tqh_first = 0x0, tqh_last = 0x0}, ph_ifaddr = {tqh_first = 0x0, 
      tqh_last = 0x0}, ph_ifnetevent = {tqh_first = 0x0, tqh_last = 0x0}, 
    ph_type = 0, ph_un = {phu_val = 0, phu_ptr = 0x0}, ph_list = {
      le_next = 0x0, le_prev = 0x0}}, if_capabilities = 0, if_capenable = 0, 
  if_csum_flags_tx = 0, if_csum_flags_rx = 0, if_afdata = {
    0x0 <repeats 33 times>}, if_mowner = 0x0}
(gdb) 

>How-To-Repeat:

Set up ppp between two systems.  Let the answering system
have a getty on the line, and use a PPP user login account that runs
pppd as the shell.  I use modems, but I don't think that's necessary.
Run quagga's ospfd and ripngd on both ends.  Configure ppp to redial
by giving it persist on one end, and put holdoff 60 in the options
file to make it be after a minute.  Then, set up a cron job to do
"pkill -HUP pppd" every 10 minutes.  I would expect a crash well
within an hour.  The igmp crash has been 75% of disconnects on the
called system, and perhaps 25% on the calling.  The in6_selecthlim
crash is rarer, perhaps 10% of disconnects on the calling system.

>Fix:

In sys/if/if.c:if_detach, invoke the PURGEIF control method for all
protocols in all families, rather than only invoking it for all
protocols within familes for which an address is configured.
While one could try to preserve the "all multicast memberships and
cached routes *ifp point to a valid struct ifnet" without this, it
would involve per-ifp purges at the time the last address of an AF is
deleted, and this is too complicated to be maintainable.

It is not clear if the above will fix the in6_selecthlim crash, but I
think there's a good chance it will.