tech-net archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: refactoring ip_output() and the L2 _output()

On 1 Feb, 2013, at 15:27 , David Young <> wrote:

> Here are the route acrobatics with some annotations:
>        if ((rt = rt0) != NULL) {
>                if ((rt->rt_flags & RTF_UP) == 0) {
> It's funny that we get into ether_output() with an rtentry that's
> not even usable.  I'm not sure how that happens.  I will have to
> look more carefully at what ip_output() is doing.
>                        if ((rt0 = rt = rtalloc1(dst, 1)) != NULL) {
>                                rt->rt_refcnt--;
>                                if (rt->rt_ifp != ifp)
>                                        return (*rt->rt_ifp->if_output)
>                                                        (ifp, m0, dst, rt);
>                        } else 
>                                senderr(EHOSTUNREACH);
>                }

I'm just catching up on reading this.

To understand this you might look at the code paths in ip_output() dealing
with the SO_DONTROUTE socket option, or maybe at how the strict source
route option is handled (I would do this but I get a feeling of despair
when I look in there..).  If nothing has changed since I last dealt with
this the traditional way operations which required matching a non-gatewayed,
interface route were implemented was to avoid the routing table in favor of
scanning the interface list looking for one with an address matching the 
destination (which sucks something awful on a router with both many interfaces
configured and a lot of applications needing to use SO_DONTROUTE) and then
delivering the packet to that interface.  This packet won't come with a decent 
since it didn't look at the routing table to arrive at this interface, but
the ARP entries are in the routing table so someone needs to look in
there anyway.  In addition, if you have two ethernets on the same subnet
the routing table will generally only have routes for one of them (since
the interface routes will be the same it is unable to keep both interfaces'
routes in there at the same time), but the SO_DONTROUTE search may instead
find the interface whose routes are not in the table.  In this case I
think the packet gets sent out the interface the SO_DONTROUTE search
found but uses an ARP entry the other interface learned to do so, which
might explain the code above (though I think it is possible to construct
interface configurations involving and ethernet and a p2p interface with
an overlapping destination address where that code fails).

I actually remember when this code made more sense.  The very earliest
ancestor of this code thought that knowing an output interface and
a next hop IP address was equivalent to knowing the L2 header for
the outgoing packet, because it was.  Most early networks stored
L2 neighbour addresses in the low order bits of the network's IP
addresses; RFC 796 lists these mappings for some large networks, but
doesn't mention that many of the earliest LAN technologies used
1 byte MAC addresses (copied from 3 Mbps Experimental Ethernet) which
were set to be the same as the low order byte of the IP interface
address on the network.  If knowing the outgoing interface and next
hop IP address is sufficient to tell you everything you need to know
to send the packet then it doesn't really matter whether you use
the routing table or the interface address configuration to make a
routing decision because the end product is the same: an output
interface and a next hop IP address.

Note that in those days 10 Mbps Ethernet was actually an outlier,
a wart.  The 6 byte MAC addresses couldn't be fit in the low order
bits of a 4 byte IP address, so the next hop IP address wasn't
directly useful since none of the bits in there told you what the
L2 header should look like.  The idea of maintaining an ARP cache,
burying a fairly complicated protocol in interface code and using
the next hop IP address as an indirection was arrived at because,
while it made Ethernet interfaces bear the additional cost of
resolving this indirection, it also made Ethernet interfaces look
like all the other interfaces where this address was not an indirection
but instead was directly what you needed to know.  Essentially
the ARP cache was a mechanism to make ethernet interfaces look
like ARPAnet interfaces.  Since we've reached a state where virtually
all the networks that worked like the ARPAnet are long gone and just
about everything is, or looks like, an ethernet interface, it is
probably a good idea to try to optimize this for ethernet rather
than for the ARPAnet.

So I think that treating ARP like other routing protocols, having
it store the routes it learns in the kernel routing table the way
other routing protocols do and getting directly from the route lookup
to the L2 header without an indirection, was exactly the right thing
to do.  The problem is that it was incompletely implemented.  If you
want to be able to get rid of interface code this way then everything
needs to make its routing decision by looking in the route table, including
SO_DONTROUTE packets and strict source route packets.

This is not difficult to do, but requires dealing with the semantic
shift in what "route lookup" would then mean.  To see this it is probably
worth distinguishing a "route lookup" from a "forwarding lookup".  A
forwarding lookup is done using only address information as the search
key, while a route lookup uses both address information and other data
associated with the route for the search (like "match routes used for
forwarding", or "match only interface routes", or "match only interface
routes for a particular interface".  In a forwarding table it makes no sense
to have more than one route with any particular destination prefix since, with
only address information as the search key, there is nothing to distinguish
routes with the same prefix.  In a routing table there is.  Since the current
kernel radix tree only has search entries taking addresses alone as keys,
and can't store more than one route with the same address and mask, it is
a forwarding table.

To do SO_DONTROUTE and strict source route routing decisions in the route
table, and eliminate that interface code, you need to provide route lookup
semantics ("match only interface routes") and you need to guarantee that
all the routes they would need to find (i.e. that they would arrive at by
scanning the interface address configuration instead) are always in that
table.  The latter is almost always the case anyway, since interface routes
are normally important for forwarding lookups too, but it still requires
bridging the gap between "almost" and "always" and that includes dealing
with cases like two ethernets attached to the same subnet; I'm pretty sure
the data structure would need to be able to store and search more than one
route to the same destination.

Speaking of which, that radix tree implementation in the kernel is very
old and crufty.  It was a very early implementation of a variable length
mask search (when routes were classed route lookups were done in a hash
table), the code was always opaque and had minor bugs that seemed impossible
to find, it isn't very fast and operates in a way which unnecessarily
penalizes longer addresses (i.e. IPv6), and we've learned just a whole lot
more about data structures for longest match lookups since then.  If
noncontiguous masks are no longer interesting then maybe it is time to
replace that?

Dennis Ferguson

Home | Main Index | Thread Index | Old Index