Re: refactoring ip_output() and the L2 _output()

To: tech-net%netbsd.org@localhost
Subject: Re: refactoring ip_output() and the L2 _output()
From: Mihai Chelaru <mihai.chelaru%ngnetworks.ro@localhost>
Date: Sat, 2 Feb 2013 16:07:29 +0200

Hi Dave,

Sounds like a good idea !

That return (*rt->rt_ifp->if_output)(ifp, m0, dst, rt) line looksbroken, but I don't think it ever happened - the code path and the biglock took good care of us. Anyhow adding a debug around that and see ifit's really happening would be interesting as someone else suggested.


Some observations:

Please move ip_hresolv_needed() as a bool property (calledresolv_needed or such) in struct ifnet - it will be used by otherprotocols as well. Moreover ip_hresolve_output() doesn't look to be IPdependent, so throwing it inside ip_output.c doesn't look good. Theklock_if_output name looks unfortunate.

I still wonder what you'll do about arpresolve - personaly, I'd like tosee an independent l2 resolving layer attached to struct ifnet and adumb if_output, so the frame will be opaque from if_output perspective.Meaning that I'd like to see if_output called with 4 parameters: ifp,mbuf, a sockaddr represeting the next hop l2 address (arp address forether), and frame_type (ETHERTYPE* for ether). The last parametershould be aquired by upper layers from a function associated withstruct ifnet - int getframetype(int address_family) while the l2addressshould be aquired by upper protocol (IP) using the mentioned resolvinglayer (ARP for IP/ETH).


--
Mihai

On 2013-02-02 01:27:07 +0200, David Young said:

I've been trying to get a handle on a few of the many MP-safety problems
in the network stack.  My investigations brought me yesterday to
ether_output().

ether_output() is surprisingly complicated.  The ether_output() that one
would hope for takes an ifnet, a packet, and a nexthop as arguments,
adds the correct L2 header to the front of the packet, and calls into a
driver's transmit routine.  The ether_output() that we have does some
acrobatics with routes, ARP resolution, cons'ing up and prepending an L2
header, queueing the packet and nudging the driver.

Here are the route acrobatics with some annotations:

        if ((rt = rt0) != NULL) {
                if ((rt->rt_flags & RTF_UP) == 0) {

It's funny that we get into ether_output() with an rtentry that's
not even usable.  I'm not sure how that happens.  I will have to
look more carefully at what ip_output() is doing.

                        if ((rt0 = rt = rtalloc1(dst, 1)) != NULL) {
                                rt->rt_refcnt--;
                                if (rt->rt_ifp != ifp)
                                        return (*rt->rt_ifp->if_output)
                                                        (ifp, m0, dst, rt);

} elsesenderr(EHOSTUNREACH);

                }

Yikes!  To call rt->rt_ifp's output routine with a different ifp is not
advisable, is it?  ISTR finding a bug in carp(4) involving a similar
pattern, ifpX->if_output(ifqY, ...).

It feels wrong that a packet should ever make a hairpin turn in an
L2 output routine, but there it is: first it looks like it's going
out ifp, then it goes out rt->rt_ifp.

The code that follows checks for a gateway route.  Finding one, it looks
for a host route to the gateway.  It seems to me that ether_output() is
a peculiar place to do that: ip_output() should have taken care of that
for us.  Also, it seems to me that the (potentially) two-step lookup,

        1) lookup(destination) -> gateway_rt,
        2) lookup(gateway_rt->destination)

ought to be a one-step lookup, lookup(destination) -> nexthop.

                if ((rt->rt_flags & RTF_GATEWAY) && dst->sa_family != AF_NS) {

I wonder what the Xerox NS (AF_NS) exemption is about? Didn't NetBSD
have a sys/netns/ that was deleted?  Surely the check isn't needed
any more?

                        if (rt->rt_gwroute == NULL)
                                goto lookup;
                        if (((rt = rt->rt_gwroute)->rt_flags & RTF_UP) == 0) {
                                rtfree(rt); rt = rt0;
                        lookup: rt->rt_gwroute = rtalloc1(rt->rt_gateway, 1);
                                if ((rt = rt->rt_gwroute) == NULL)
                                        senderr(EHOSTUNREACH);

/* the "G" test below also prevents rt== rt0 */

                                if ((rt->rt_flags & RTF_GATEWAY) ||
                                    (rt->rt_ifp != ifp)) {
                                        rt->rt_refcnt--;
                                        rt0->rt_gwroute = NULL;
                                        senderr(EHOSTUNREACH);
                                }
                        }
                }
                if (rt->rt_flags & RTF_REJECT)
                        if (rt->rt_rmx.rmx_expire == 0 ||
                            (u_long) time_second < rt->rt_rmx.rmx_expire)
                                senderr(rt == rt0 ? EHOSTDOWN : EHOSTUNREACH);
        }

The route acrobatics alone are replicated in nd6_output(). nd6_output()
will call if_output == ether_output, so similar acrobatics are
performed twice.  Of course, the latter performance is cut short.

Here is the ARP resolution:

        case AF_INET:
                if (m->m_flags & M_BCAST)
                        (void)memcpy(edst, etherbroadcastaddr, sizeof(edst));
                else if (m->m_flags & M_MCAST)
                        ETHER_MAP_IP_MULTICAST(&satocsin(dst)->sin_addr, edst);
                else if (!arpresolve(ifp, rt, m, dst, edst))
                        return (0);     /* if not yet resolved */
                /* If broadcasting on a simplex interface, loopback a copy */
                if ((m->m_flags & M_BCAST) && (ifp->if_flags & IFF_SIMPLEX))
                        mcopy = m_copy(m, 0, (int)M_COPYALL);
                etype = htons(ETHERTYPE_IP);
                break;

BTW, it makes me grimace to see the simplex-loopback handled in an L2
transmit routine.  Here's the rest of that,

        if (mcopy)
                (void)looutput(ifp, mcopy, dst, rt);

Shouldn't that be looutput(lo0ifp, ...) or lo0ifp->if_output(lo0ifp,
...) ? It seems that the network layer should handle loopback itself by
copying the packet and ip_input()'ing it directly.

Both the route acrobatics and the ARP resolution are replicated in
arc_output(), ieee1394_output(), and fddi_output().

As a first attempt at cleaning this up, I've extracted the route
acrobatics from ether_output and moved them to a routine called
ip_hresolv_output that wraps the ifp->if_output call in ip_output.  See
attached patch.  I guess that this if any protocols other than IPv4 and
IPv6 rely on the acrobatics, this change will break them.

I'd like ether_output never, ever to deal with the routing table,
but arpresolve() deals some in routes, so eventually I will have
to factor that out.  I'm aiming for a streamlined IP transmission
path.  Today we have something comparatively baroque,

        # enter ip_output
        dst <- destination(packet)
        nexthop <- lookup(ip routing table, dst)
        # enter ether_output
        if (nexthop is gateway)
                if (gateway is cached)
                        nexthop <- gwcache(nexthop)
                else
                        gwcache(nexthop) <- lookup(ip routing table,
                                                   gwaddr(nexthop))
                        nexthop <- gwcache(nexthop)
        if (nexthop l2 info incomplete)
                defer_until_arp_complete(nexthop, packet)
        else
                l2hdr <- create_l2header(interface(nexthop), gwaddr(nexthop))
                transmit(interface(nexthop), l2hdr | packet)

I'm aiming for something more like this:

        # enter ip_output
        dst <- destination(packet)
        nexthop <- lookup(ip routing table, dst)
        if (nexthop l2 info incomplete)
                defer_until_arp_complete(nexthop, packet)
        else
                # enter ether_output
                transmit(interface(nexthop), l2header(nexthop), packet)

It will take several steps to get there.  Here is one.  A step in the
wrong direction?  Your feedback, please. :-)

Dave

References:
- refactoring ip_output() and the L2 _output()
  - From: David Young

Prev by Date: Re: refactoring ip_output() and the L2 _output()
Next by Date: Re: Temporary IPv6 addresses vs. netgroups
Previous by Thread: Re: refactoring ip_output() and the L2 _output()
Next by Thread: Re: refactoring ip_output() and the L2 _output()
Indexes:

Home | Main Index | Thread Index | Old Index