Re: Improving use of rt_refcnt

To: Ryota Ozaki <ozaki-r%netbsd.org@localhost>
Subject: Re: Improving use of rt_refcnt
From: Dennis Ferguson <dennis.c.ferguson%gmail.com@localhost>
Date: Wed, 8 Jul 2015 21:28:01 -0700

On 7 Jul, 2015, at 21:25 , Ryota Ozaki <ozaki-r%netbsd.org@localhost> wrote:

> BTW how do you think of separating L2 tables (ARP/NDP) from the L3
> routing tables? The separation gets rid of cloning/cloned route
> features and that makes it easy to introduce locks in route.c.
> (Currently rtrequest1 can be called recursively to remove cloned
> routes and that makes it hard to use locks.) I read your paper
> (BSDNetworking.pdf) and it seems to suggest to maintain L2 routes
> in the common routing table (I may misunderstand your opinion).

I think it is worth stepping back and thinking about what the end
result of the most common type of access to the route table (a
forwarding operation, done by a reader who wants to know what to do
with a packet it has) is going to be, since this is the operation you
want to optimize.  If the packet is to be sent out an interface then
the result of the work you are doing is that an L2 header will be
prepended to the packet and the packet will be queued to an interface
for transmission.

To make this direct and fast what you want is for the result of the
route lookup to point directly at the thing that knows what L2 header
needs to be added and which interface the packet needs to be delivered
to.  If you have that then all that remains to be done after the
route lookup is to make space at the front of the packet for the L2
header, memcpy() it in and give the resulting frame to the interface.
So you want the route lookup organized to get you from the addresses
in the packet you are processing to the L2 header and interface you
need to use to send a packet like that as directly as possible.

While we could talk about how the route lookup might be structured
to better get directly to the point (this involves splitting the
rtentry into a "route" part and a "nexthop" part, the latter being
the result of a lookup and having the data needed to deliver the
packet with minimal extra work), this probably isn't relevant to
your question.  What I did want to point out, however, is that
knowledge of the next hop IP address is (generally) entirely
unnecessary to forward a packet.  All forwarding operations want
to know is the L2 header to add to the packet.  Of course ARP or
ND will have used the next hop IP address to determine the L2 header
to attach to the packet, but once this is known all packet forwarding
wants is the result, the L2 header, and doesn't care how that was
arrived at.  What this means is that your proposed use of the next
hop IP address is a gratuitous indirection; you would be taking
something which would be best done as

    <route lookup> -> <L2 header>

and instead turning this into

    <route lookup> -> <next hop IP address> -> <next hop address lookup> -> <L2 header>

This will likely always be significantly more expensive than the direct
alternative.  The indirection is also easy to resolve up front, when a route
is added, so there's no need to do it over and over again for each forwarded
packet, and failing to do it when routes are installed moves yet another
data structure (per-interface) into the forwarding path that will need to
be dealt with if you eventually want to eliminate the locks.  I think
you shouldn't do this, or anything else that requires if_output() to
look at the next hop IP address, since that indirection should go away.

The neat thing about this is that the internal arrangement that makes
one think that the next hop IP address is an important result of a route
lookup (it is listed as one in the rtentry structure, and if_output()
takes it as an argument) is actually a historical artifact.  I think
this code was written in about 1980.  Then, as now, the point of the
route lookup was to determine the L2 header to prepend to the packet
and the interface to queue it to, but what was different was the networks
that existed then.  Almost all of them did <IP address> -> <L2 header>
mapping by storing the variable bits of the L2 header directly in the
local bits of the IP address; see RFC796 and RFC895 for a whole bunch of
examples (the all-zeros-host-part directed broadcast address that 4.2BSD
used came from the mapping for experimental ethernet).  This meant that
the next hop IP address wasn't an indirection at all, it was directly
the data you needed to construct the L2 header to add to the packet.
The original exception to this was DIX Ethernet, with its 48 bit MAC
addresses that were too big to store that way, so the idea of
implementing an ARP cache in the interface code and using the next hop
IP address as a less efficient indirection to the L2 header data for
that type of interface, was invented to make DIX Ethernet look like a
"normal" interface where the next hop IP address directly and efficiently
provided the L2 bits you needed to know to send the packet.

The thing is that pretty much all the networks that were "normal"
in 1980 had disappeared by about 1990, leaving only networks that
worked like DIX ethernet.  You would think the code would have been
restructured for the new "normal" since then, but I guess old code
dies hard.

Dennis Ferguson

Follow-Ups:
- Re: Improving use of rt_refcnt
  - From: Mouse
- Re: Improving use of rt_refcnt
  - From: Ryota Ozaki
- Re: Improving use of rt_refcnt
  - From: Joerg Sonnenberger

References:
- Improving use of rt_refcnt
  - From: Ryota Ozaki
- Re: Improving use of rt_refcnt
  - From: David Young
- Re: Improving use of rt_refcnt
  - From: Ryota Ozaki
- Re: Improving use of rt_refcnt
  - From: Joerg Sonnenberger
- Re: Improving use of rt_refcnt
  - From: Ryota Ozaki
- Re: Improving use of rt_refcnt
  - From: Dennis Ferguson
- Re: Improving use of rt_refcnt
  - From: Ryota Ozaki
- Re: Improving use of rt_refcnt
  - From: Ryota Ozaki

Prev by Date: Re: mount_checkdirs
Next by Date: Re: Improving use of rt_refcnt
Previous by Thread: Re: Improving use of rt_refcnt
Next by Thread: Re: Improving use of rt_refcnt
Indexes:

Home | Main Index | Thread Index | Old Index