tech-net archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

refactoring ip_output() and the L2 _output()



I've been trying to get a handle on a few of the many MP-safety problems
in the network stack.  My investigations brought me yesterday to
ether_output().

ether_output() is surprisingly complicated.  The ether_output() that one
would hope for takes an ifnet, a packet, and a nexthop as arguments,
adds the correct L2 header to the front of the packet, and calls into a
driver's transmit routine.  The ether_output() that we have does some
acrobatics with routes, ARP resolution, cons'ing up and prepending an L2
header, queueing the packet and nudging the driver.

Here are the route acrobatics with some annotations:

        if ((rt = rt0) != NULL) {
                if ((rt->rt_flags & RTF_UP) == 0) {

It's funny that we get into ether_output() with an rtentry that's
not even usable.  I'm not sure how that happens.  I will have to
look more carefully at what ip_output() is doing.

                        if ((rt0 = rt = rtalloc1(dst, 1)) != NULL) {
                                rt->rt_refcnt--;
                                if (rt->rt_ifp != ifp)
                                        return (*rt->rt_ifp->if_output)
                                                        (ifp, m0, dst, rt);
                        } else 
                                senderr(EHOSTUNREACH);
                }

Yikes!  To call rt->rt_ifp's output routine with a different ifp is not
advisable, is it?  ISTR finding a bug in carp(4) involving a similar
pattern, ifpX->if_output(ifqY, ...).

It feels wrong that a packet should ever make a hairpin turn in an
L2 output routine, but there it is: first it looks like it's going
out ifp, then it goes out rt->rt_ifp.

The code that follows checks for a gateway route.  Finding one, it looks
for a host route to the gateway.  It seems to me that ether_output() is
a peculiar place to do that: ip_output() should have taken care of that
for us.  Also, it seems to me that the (potentially) two-step lookup,

        1) lookup(destination) -> gateway_rt,
        2) lookup(gateway_rt->destination)

ought to be a one-step lookup, lookup(destination) -> nexthop.

                if ((rt->rt_flags & RTF_GATEWAY) && dst->sa_family != AF_NS) {

I wonder what the Xerox NS (AF_NS) exemption is about? Didn't NetBSD
have a sys/netns/ that was deleted?  Surely the check isn't needed
any more?

                        if (rt->rt_gwroute == NULL)
                                goto lookup;
                        if (((rt = rt->rt_gwroute)->rt_flags & RTF_UP) == 0) {
                                rtfree(rt); rt = rt0;
                        lookup: rt->rt_gwroute = rtalloc1(rt->rt_gateway, 1);
                                if ((rt = rt->rt_gwroute) == NULL)
                                        senderr(EHOSTUNREACH);
                                /* the "G" test below also prevents rt == rt0 */
                                if ((rt->rt_flags & RTF_GATEWAY) ||
                                    (rt->rt_ifp != ifp)) {
                                        rt->rt_refcnt--;
                                        rt0->rt_gwroute = NULL;
                                        senderr(EHOSTUNREACH);
                                }
                        }
                }
                if (rt->rt_flags & RTF_REJECT)
                        if (rt->rt_rmx.rmx_expire == 0 ||
                            (u_long) time_second < rt->rt_rmx.rmx_expire)
                                senderr(rt == rt0 ? EHOSTDOWN : EHOSTUNREACH);
        }

The route acrobatics alone are replicated in nd6_output(). nd6_output()
will call if_output == ether_output, so similar acrobatics are
performed twice.  Of course, the latter performance is cut short.

Here is the ARP resolution:

        case AF_INET:
                if (m->m_flags & M_BCAST)
                        (void)memcpy(edst, etherbroadcastaddr, sizeof(edst));
                else if (m->m_flags & M_MCAST)
                        ETHER_MAP_IP_MULTICAST(&satocsin(dst)->sin_addr, edst);
                else if (!arpresolve(ifp, rt, m, dst, edst))
                        return (0);     /* if not yet resolved */
                /* If broadcasting on a simplex interface, loopback a copy */
                if ((m->m_flags & M_BCAST) && (ifp->if_flags & IFF_SIMPLEX))
                        mcopy = m_copy(m, 0, (int)M_COPYALL);
                etype = htons(ETHERTYPE_IP);
                break;

BTW, it makes me grimace to see the simplex-loopback handled in an L2
transmit routine.  Here's the rest of that,

        if (mcopy)
                (void)looutput(ifp, mcopy, dst, rt);

Shouldn't that be looutput(lo0ifp, ...) or lo0ifp->if_output(lo0ifp,
...) ? It seems that the network layer should handle loopback itself by
copying the packet and ip_input()'ing it directly.

Both the route acrobatics and the ARP resolution are replicated in
arc_output(), ieee1394_output(), and fddi_output().

As a first attempt at cleaning this up, I've extracted the route
acrobatics from ether_output and moved them to a routine called
ip_hresolv_output that wraps the ifp->if_output call in ip_output.  See
attached patch.  I guess that this if any protocols other than IPv4 and
IPv6 rely on the acrobatics, this change will break them.

I'd like ether_output never, ever to deal with the routing table,
but arpresolve() deals some in routes, so eventually I will have
to factor that out.  I'm aiming for a streamlined IP transmission
path.  Today we have something comparatively baroque,

        # enter ip_output
        dst <- destination(packet)
        nexthop <- lookup(ip routing table, dst)
        # enter ether_output
        if (nexthop is gateway)
                if (gateway is cached)
                        nexthop <- gwcache(nexthop)
                else
                        gwcache(nexthop) <- lookup(ip routing table,
                                                   gwaddr(nexthop))
                        nexthop <- gwcache(nexthop)
        if (nexthop l2 info incomplete)
                defer_until_arp_complete(nexthop, packet)
        else
                l2hdr <- create_l2header(interface(nexthop), gwaddr(nexthop))
                transmit(interface(nexthop), l2hdr | packet)

I'm aiming for something more like this:

        # enter ip_output
        dst <- destination(packet)
        nexthop <- lookup(ip routing table, dst)
        if (nexthop l2 info incomplete)
                defer_until_arp_complete(nexthop, packet)
        else
                # enter ether_output
                transmit(interface(nexthop), l2header(nexthop), packet)

It will take several steps to get there.  Here is one.  A step in the
wrong direction?  Your feedback, please. :-)

Dave

-- 
David Young
dyoung%pobox.com@localhost    Urbana, IL    (217) 721-9981
--- sys/net/if_arcsubr.c        2013-02-01 16:15:44.951210808 -0500
+++ sys/net/if_arcsubr.c        2013-02-01 13:13:28.000000000 -0500
@@ -134,28 +134,7 @@ arc_output(struct ifnet *ifp, struct mbu
 
        myself = *CLLADDR(ifp->if_sadl);
 
-       if ((rt = rt0)) {
-               if ((rt->rt_flags & RTF_UP) == 0) {
-                       if ((rt0 = rt = rtalloc1(dst, 1)))
-                               rt->rt_refcnt--;
-                       else
-                               senderr(EHOSTUNREACH);
-               }
-               if (rt->rt_flags & RTF_GATEWAY) {
-                       if (rt->rt_gwroute == 0)
-                               goto lookup;
-                       if (((rt = rt->rt_gwroute)->rt_flags & RTF_UP) == 0) {
-                               rtfree(rt); rt = rt0;
-                       lookup: rt->rt_gwroute = rtalloc1(rt->rt_gateway, 1);
-                               if ((rt = rt->rt_gwroute) == 0)
-                                       senderr(EHOSTUNREACH);
-                       }
-               }
-               if (rt->rt_flags & RTF_REJECT)
-                       if (rt->rt_rmx.rmx_expire == 0 ||
-                           time_second < rt->rt_rmx.rmx_expire)
-                               senderr(rt == rt0 ? EHOSTDOWN : EHOSTUNREACH);
-       }
+       rt = rt0;
 
        /*
         * if the queueing discipline needs packet classification,
--- sys/net/if_ethersubr.c      2013-02-01 16:15:44.963210857 -0500
+++ sys/net/if_ethersubr.c      2013-02-01 15:32:05.000000000 -0500
@@ -279,38 +247,7 @@ ether_output(struct ifnet * const ifp0, 
 
        if ((ifp->if_flags & (IFF_UP|IFF_RUNNING)) != (IFF_UP|IFF_RUNNING))
                senderr(ENETDOWN);
-       if ((rt = rt0) != NULL) {
-               if ((rt->rt_flags & RTF_UP) == 0) {
-                       if ((rt0 = rt = rtalloc1(dst, 1)) != NULL) {
-                               rt->rt_refcnt--;
-                               if (rt->rt_ifp != ifp)
-                                       return (*rt->rt_ifp->if_output)
-                                                       (ifp, m0, dst, rt);
-                       } else
-                               senderr(EHOSTUNREACH);
-               }
-               if ((rt->rt_flags & RTF_GATEWAY) && dst->sa_family != AF_NS) {
-                       if (rt->rt_gwroute == NULL)
-                               goto lookup;
-                       if (((rt = rt->rt_gwroute)->rt_flags & RTF_UP) == 0) {
-                               rtfree(rt); rt = rt0;
-                       lookup: rt->rt_gwroute = rtalloc1(rt->rt_gateway, 1);
-                               if ((rt = rt->rt_gwroute) == NULL)
-                                       senderr(EHOSTUNREACH);
-                               /* the "G" test below also prevents rt == rt0 */
-                               if ((rt->rt_flags & RTF_GATEWAY) ||
-                                   (rt->rt_ifp != ifp)) {
-                                       rt->rt_refcnt--;
-                                       rt0->rt_gwroute = NULL;
-                                       senderr(EHOSTUNREACH);
-                               }
-                       }
-               }
-               if (rt->rt_flags & RTF_REJECT)
-                       if (rt->rt_rmx.rmx_expire == 0 ||
-                           (u_long) time_second < rt->rt_rmx.rmx_expire)
-                               senderr(rt == rt0 ? EHOSTDOWN : EHOSTUNREACH);
-       }
+       rt = rt0;
 
        switch (dst->sa_family) {
 
--- sys/net/if_fddisubr.c       2013-02-01 16:15:44.972210894 -0500
+++ sys/net/if_fddisubr.c       2013-02-01 13:12:56.000000000 -0500
@@ -238,30 +238,7 @@ fddi_output(struct ifnet *ifp0, struct m
 #endif /* NCARP > 0 */
        if ((ifp->if_flags & (IFF_UP|IFF_RUNNING)) != (IFF_UP|IFF_RUNNING))
                senderr(ENETDOWN);
-#if !defined(__bsdi__) || _BSDI_VERSION >= 199401
-       if ((rt = rt0) != NULL) {
-               if ((rt->rt_flags & RTF_UP) == 0) {
-                       if ((rt0 = rt = rtalloc1(dst, 1)) != NULL)
-                               rt->rt_refcnt--;
-                       else
-                               senderr(EHOSTUNREACH);
-               }
-               if (rt->rt_flags & RTF_GATEWAY) {
-                       if (rt->rt_gwroute == 0)
-                               goto lookup;
-                       if (((rt = rt->rt_gwroute)->rt_flags & RTF_UP) == 0) {
-                               rtfree(rt); rt = rt0;
-                       lookup: rt->rt_gwroute = rtalloc1(rt->rt_gateway, 1);
-                               if ((rt = rt->rt_gwroute) == 0)
-                                       senderr(EHOSTUNREACH);
-                       }
-               }
-               if (rt->rt_flags & RTF_REJECT)
-                       if (rt->rt_rmx.rmx_expire == 0 ||
-                           time_second < rt->rt_rmx.rmx_expire)
-                               senderr(rt == rt0 ? EHOSTDOWN : EHOSTUNREACH);
-       }
-#endif
+       rt = rt0;
 
        /*
         * If the queueing discipline needs packet classification,
--- sys/net/if_ieee1394subr.c   2013-02-01 16:15:44.981210932 -0500
+++ sys/net/if_ieee1394subr.c   2013-02-01 13:15:12.000000000 -0500
@@ -100,40 +100,7 @@ ieee1394_output(struct ifnet *ifp, struc
 
        if ((ifp->if_flags & (IFF_UP|IFF_RUNNING)) != (IFF_UP|IFF_RUNNING))
                senderr(ENETDOWN);
-       if ((rt = rt0) != NULL) {
-               if ((rt->rt_flags & RTF_UP) == 0) {
-                       if ((rt0 = rt = rtalloc1(dst, 1)) != NULL) {
-                               rt->rt_refcnt--;
-                               if (rt->rt_ifp != ifp)
-                                       return (*rt->rt_ifp->if_output)
-                                                       (ifp, m0, dst, rt);
-                       } else
-                               senderr(EHOSTUNREACH);
-               }
-               if (rt->rt_flags & RTF_GATEWAY) {
-                       if (rt->rt_gwroute == NULL)
-                               goto lookup;
-                       if (((rt = rt->rt_gwroute)->rt_flags & RTF_UP) == 0) {
-                               rtfree(rt);
-                               rt = rt0;
-  lookup:
-                               rt->rt_gwroute = rtalloc1(rt->rt_gateway, 1);
-                               if ((rt = rt->rt_gwroute) == NULL)
-                                       senderr(EHOSTUNREACH);
-                               /* the "G" test below also prevents rt == rt0 */
-                               if ((rt->rt_flags & RTF_GATEWAY) ||
-                                   (rt->rt_ifp != ifp)) {
-                                       rt->rt_refcnt--;
-                                       rt0->rt_gwroute = NULL;
-                                       senderr(EHOSTUNREACH);
-                               }
-                       }
-               }
-               if (rt->rt_flags & RTF_REJECT)
-                       if (rt->rt_rmx.rmx_expire == 0 ||
-                           time_second < rt->rt_rmx.rmx_expire)
-                               senderr(rt == rt0 ? EHOSTDOWN : EHOSTUNREACH);
-       }
+       rt = rt0;
 
        /*
         * If the queueing discipline needs packet classification,
--- sys/netinet/in_offload.c    2013-02-01 16:15:44.990210968 -0500
+++ sys/netinet/in_offload.c    2013-02-01 14:47:07.000000000 -0500
@@ -37,6 +37,7 @@ __KERNEL_RCSID(0, "$NetBSD: in_offload.c
 #include <netinet/in.h>
 #include <netinet/in_systm.h>
 #include <netinet/ip.h>
+#include <netinet/ip_var.h>
 #include <netinet/tcp.h>
 #include <netinet/in_offload.h>
 
@@ -55,9 +56,7 @@ ip_tso_output_callback(void *vp, struct 
        struct ifnet *ifp = args->ifp;
        int error;
 
-       KERNEL_LOCK(1, NULL);
-       error = (*ifp->if_output)(ifp, m, args->sa, args->rt);
-       KERNEL_UNLOCK_ONE(NULL);
+       error = ip_hresolv_output(ifp, m, args->sa, args->rt);
        return error;
 }
 
--- sys/netinet/ip_output.c     2013-02-01 16:15:44.998211001 -0500
+++ sys/netinet/ip_output.c     2013-02-01 14:53:26.000000000 -0500
@@ -113,6 +113,7 @@ __KERNEL_RCSID(0, "$NetBSD: ip_output.c,
 #include <sys/proc.h>
 
 #include <net/if.h>
+#include <net/if_types.h>
 #include <net/route.h>
 #include <net/pfil.h>
 
@@ -160,6 +161,107 @@ extern struct pfil_head inet_pfil_hook;   
 
 int    ip_do_loopback_cksum = 0;
 
+static bool
+ip_hresolv_needed(const struct ifnet * const ifp)
+{
+       switch (ifp->if_type) {
+       case IFT_ARCNET:
+       case IFT_ETHER:
+       case IFT_FDDI:
+       case IFT_IEEE1394:
+               return 1;
+       default:
+               return 0;
+       }
+}
+
+/* (Optionally) hold the kernel lock over an if_output call.
+ *
+ * Currently this holds the kernel lock over the if_output call
+ * only if the ifp has no if_intern_txenqueue method.
+ */
+static __attribute__((always_inline)) inline int
+klock_if_output(struct ifnet * const ifp, struct mbuf * const m,
+    const struct sockaddr * const dst, struct rtentry *rt)
+{
+       int error;
+
+       if (ifp->if_intern_txenqueue == NULL)
+               KERNEL_LOCK(1, NULL);
+
+       error = (*ifp->if_output)(ifp, m, dst, rt);
+
+       if (ifp->if_intern_txenqueue == NULL)
+               KERNEL_UNLOCK_ONE(NULL);
+
+       return error;
+}
+
+/*
+ * Send an IP packet to a host.
+ *
+ * If necessary, resolve the arbitrary IP route, rt0, to an IP host route 
before
+ * calling ifp's output routine.
+ */
+int
+ip_hresolv_output(struct ifnet * const ifp, struct mbuf * const m,
+    const struct sockaddr * const dst, struct rtentry *rt0)
+{
+#define senderr(e) { error = (e); goto bad;}
+       int error = 0;
+       struct rtentry *rt;
+
+       if (!ip_hresolv_needed(ifp))
+               return klock_if_output(ifp, m, dst, rt0);
+
+       if ((rt = rt0) == NULL)
+               return klock_if_output(ifp, m, dst, NULL);
+
+       /* The following block is highly questionable.  How did we get here
+        * with a !RTF_UP route?  Does rtalloc1() always return an RTF_UP
+        * route?
+        */
+       if ((rt->rt_flags & RTF_UP) == 0) {
+               if ((rt0 = rt = rtalloc1(dst, 1)) == NULL)
+                       senderr(EHOSTUNREACH);
+               rt->rt_refcnt--;
+               if (rt->rt_ifp != ifp)
+                       return ip_hresolv_output(rt->rt_ifp, m, dst, rt);
+       }
+
+       if ((rt->rt_flags & RTF_GATEWAY) == 0)
+               return klock_if_output(ifp, m, dst, rt);
+
+       if ((rt = rt->rt_gwroute) == NULL || (rt->rt_flags & RTF_UP) == 0) {
+               if (rt != NULL) {
+                       rtfree(rt);
+                       rt = rt0;
+               }
+               rt->rt_gwroute = rtalloc1(rt->rt_gateway, 1);
+               if ((rt = rt->rt_gwroute) == NULL)
+                       senderr(EHOSTUNREACH);
+               /* the "G" test below also prevents rt == rt0 */
+               if ((rt->rt_flags & RTF_GATEWAY) != 0 || rt->rt_ifp != ifp) {
+                       rt->rt_refcnt--;
+                       rt0->rt_gwroute = NULL;
+                       senderr(EHOSTUNREACH);
+               }
+       }
+       if ((rt->rt_flags & RTF_REJECT) != 0) {
+               if (rt->rt_rmx.rmx_expire == 0 ||
+                   time_second < rt->rt_rmx.rmx_expire)
+                       senderr(rt == rt0 ? EHOSTDOWN : EHOSTUNREACH);
+       }
+
+       return klock_if_output(ifp, m, dst, rt);
+bad:
+       if (m != NULL)
+               m_freem(m);
+
+       return error;
+#undef senderr
+}
+
 /*
  * IP output.  The packet in mbuf chain m contains a skeletal IP
  * header (with len, off, ttl, proto, tos, src, dst).
@@ -911,13 +1046,11 @@ spd_done:
                if (__predict_true(
                    (m->m_pkthdr.csum_flags & M_CSUM_TSOv4) == 0 ||
                    (ifp->if_capenable & IFCAP_TSOv4) != 0)) {
-                       KERNEL_LOCK(1, NULL);
                        error =
-                           (*ifp->if_output)(ifp, m,
+                           ip_hresolv_output(ifp, m,
                                (m->m_flags & M_MCAST) ?
                                    sintocsa(rdst) : sintocsa(dst),
                                rt);
-                       KERNEL_UNLOCK_ONE(NULL);
                } else {
                        error =
                            ip_tso_output(ifp, m,
@@ -989,12 +1122,10 @@ spd_done:
                        {
                                KASSERT((m->m_pkthdr.csum_flags &
                                    (M_CSUM_UDPv4 | M_CSUM_TCPv4)) == 0);
-                               KERNEL_LOCK(1, NULL);
-                               error = (*ifp->if_output)(ifp, m,
+                               error = ip_hresolv_output(ifp, m,
                                    (m->m_flags & M_MCAST) ?
                                        sintocsa(rdst) : sintocsa(dst),
                                    rt);
-                               KERNEL_UNLOCK_ONE(NULL);
                        }
                } else
                        m_freem(m);
--- sys/netinet/ip_var.h        2013-02-01 16:15:45.009211048 -0500
+++ sys/netinet/ip_var.h        2013-02-01 14:29:14.000000000 -0500
@@ -222,6 +222,10 @@ void        rip_input(struct mbuf *, ...);
 int     rip_output(struct mbuf *, ...);
 int     rip_usrreq(struct socket *,
            int, struct mbuf *, struct mbuf *, struct mbuf *, struct lwp *);
+
+int    ip_hresolv_output(struct ifnet * const, struct mbuf * const,
+           const struct sockaddr * const, struct rtentry *);
+
 int    ipflow_init(int);
 void   ipflow_poolinit(void);
 void   ipflow_prune(void);


Home | Main Index | Thread Index | Old Index