Plan for improving IP_PKTINFO socket option handling

To: tech-net%netbsd.org@localhost
Subject: Plan for improving IP_PKTINFO socket option handling
From: Tom Ivar Helbekkmo <tih%hamartun.priv.no@localhost>
Date: Thu, 28 Dec 2017 17:15:27 +0100

I'd like to make some changes to the IPv4 socket option handling.
Specifically, I want to change how the IP_PKTINFO options are handled.
Before I attempt to change any code, I'd like input on the plan.

First, a bit of background.

I've been looking at getting the PowerDNS applications (authoritative
name server, recursive name server, and DNS load balancer/firewall) to
compile cleanly on NetBSD, and while I've been able to do so, it took
some ugly workarounds.  Digging into the standards, the source code,
and the documentation from Solaris, Linux, and our own NetBSD (FreeBSD
doesn't do IP_PKTINFO, having instead created an IP_SENDSRCADDR option
as a partner to the traditional IP_RECVDSTADDR), I find that there are
a number of differences, some for no good reason at all.  In a couple
of cases, our code is just wrong.  Also, our documentation of these
options is unclear, and contains errors.

The reason these things exist at all is to enable the owner of a
wildcard bound socket to find out which interface and address an
incoming connection was actually received by, and, in the case of a
UDP socket, to set the source address of an outgoing packet, typically
so that the sender of a UDP request can recognize the response.  For
ease of use, recvmsg() delivers the extra information as a control
message which may then be supplied unchanged to sendmsg() when sending
the response, setting the source address to the original destination.

The IPv4 implementation of the *PKTINFO options is not standardized.
It has been implemented several times, modeled, with varying degrees
of accuracy, on the IPv6 version, standardized by RFC3542.

Here's a summary of the IPv6 functionality:

Option IPV6_RECVPKTINFO on socket:
   recvmsg() will supply IPV6_PKTINFO cmsgs for incoming packets

Option IPV6_PKTINFO on socket:
   sets the default source address to be used when sending packets

Control message IPV6_PKTINFO from recvmsg():
   contains an in6_pktinfo structure with the specific destination address
   
Control message IPV6_PKTINFO to sendmsg():
   supply an in6_pktinfo structure with the source address to be used

All of these work the same way on BSD, Solaris, and Linux (as per
RFC3542).  The in6_pktinfo structure holds the address (in ipi6_addr),
and the interface index (ipi6_ifindex).

Note how the IPV6_RECVPKTINFO option is used to request IPV6_PKTINFO
control messages with incoming packets, while the IPV6_PKTINFO option
sets a default source address for the socket, and the IPV6_PKTINFO
control message on an outgoing packet sets the source address for that
particular packet.

Now to the IPv4 implementation.  In Solaris, this was done as a direct
translation of the IPv6 option set:

Option IP_RECVPKTINFO on socket:
   recvmsg() will supply IP_PKTINFO cmsgs for incoming packets

Option IP_PKTINFO on socket:
   sets the default source address to be used when sending packets

Control message IP_PKTINFO from recvmsg():
   contains an in_pktinfo structure with the specific destination address

Control message IP_PKTINFO to sendmsg():
   supply an in_pktinfo structure with the source address to be used

Then Linux almost copied this scheme, but they dropped IP_RECVPKTINFO,
instead using the IP_PKTINFO option to control the delivery of
IP_PKTINFO control messages with incoming packets.  In doing so, they
lost the ability to set a default outgoing source address.  This is
arguably not a great loss, but it does break compatibility with
Solaris, and it gratuitously breaks orthogonality with IPv6.

Next, while Solaris and Linux still have the ipi_ifindex and ipi_addr
fields, they decided to add a new field, ipi_spec_dst.  The name is
supposed to refer to the "specific destination" described in RFCs 1122
and 1123.  They chose to differentiate between the destination address
as supplied in the incoming IP packet itself, and the local address
the packet was, in fact, delivered to (specifically, ipi_spec_dst is
said to be "the destination address of the routing table entry").  For
outgoing packets, the IP_PKTINFO option's ipi_spec_dst field will be
used as the source address.

The only real example I can think of is where you listen on 0/0, and
receive a packet on the loopback interface, addressed not to
127.0.0.1, but, say, 127.1.2.3.  By the documentation, this should
give an IP_PKTINFO control message with ipi_addr set to 127.1.2.3, and
ipi_spec_dst 127.0.0.1.  That's not how Linux works, though: it will
set both to 127.1.2.3.  Sending a response, if you pass that control
message unchanged to sendmsg(), you'll be sending from 127.1.2.3
(instead of the documented 127.0.0.1, which wouldn't work), and this
may be a hint to why Linux puts the packet header destination in both
fields.  On NetBSD, sending to 127.1.2.3 doesn't work at all.

(This is a general difference in the handling of the loopback
interface: if you 'ping 127.1.2.3' on Linux, you get responses from
127.1.2.3.  On NetBSD, you get a 'network unreachable' instead.)

Now, on to NetBSD.

We've mostly copied the way things work in Solaris and Linux, but with
a couple of little twists that break source compatibility with both.

First, we don't have the ipi_spec_dst field at all.  Since a lot of
source code out there is written with Solaris and/or Linux in mind,
this breaks compatibility at the source level.  I don't have a Solaris
system handy for testing, but from what I observe on Linux, and how
its loopback handling differs from NetBSD, as described above, we
could just toss in a "#define ipi_spec_dst ipi_addr" and be good.

Next, we do something really silly with the name IP_RECVPKTINFO.
Remember that this is the option to turn on the generation of
IP_PKTINFO control messages for recvmsg(), and that Linux dropped it,
changing the IP_PKTINFO option to do this instead of setting the
default source address for outgoing packets?  Well, we've reinstated
the option, but in NetBSD it enables the generation of IP_RECVPKTINFO
control messages containing the *source* addresses of the incoming
packets.  This is completely meaningless, as we have that information
in the standard message header from recvmsg() already, so it'll never
be used for this purpose.

What it does do, though, is trick source code that supports the
Solaris IP_RECVPKTINFO option into thinking we work the same way.  See
external/bsd/dhcp/dist/common/socket.c for an example of functionality
we're missing.  Note how they test for the presence of both symbols
IP_PKTINFO and IP_RECVPKTINFO, and then assume that the functionality
of Solaris is present.  Other code I've read checks for IP_PKTINFO
first, and then uses IP_RECVPKTINFO to decide whether to do things the
Solaris or the Linux way.  Our use of the latter symbol breaks this.

Finally, here's what I'd like to change:

1) "#define ipi_spec_dst ipi_addr" in <netinet/in.h>

2) Change the IP_RECVPKTINFO option to control the generation of
   IP_PKTINFO control messages, the way it's done in Solaris.

3) Remove the superfluous IP_RECVPKTINFO control message.

4) Change the IP_PKTINFO option to do different things depending on
   the parameter it's supplied with:
   - If it's sizeof(int), assume it's being used as in Linux:
     - If it's non-zero, turn on the IP_RECVPKTINFO option.
     - If it's zero, turn off the IP_RECVPKTINFO option.
   - If it's sizeof(struct in_pktinfo), assume it's being used as in
     Solaris, to set a default for the source interface and/or
     source address for outgoing packets on the socket.

5) Fix our documentation.  Both ip(4) and ip6(4) contain errors in
   their descriptions of these particular options and control messages.

With this, we should have automatic source code compatibility with
pretty much everything, and orthogonality between IPv6 and IPv4.

-tih
-- 
Most people who graduate with CS degrees don't understand the significance
of Lisp.  Lisp is the most important idea in computer science.  --Alan Kay

Follow-Ups:
- Re: Plan for improving IP_PKTINFO socket option handling
  - From: Tom Ivar Helbekkmo
- Re: Plan for improving IP_PKTINFO socket option handling
  - From: Robert Elz
- Re: Plan for improving IP_PKTINFO socket option handling
  - From: Christos Zoulas

Prev by Date: Re: /etc/rc.d/dhcpd6 (correction)
Next by Date: Re: Plan for improving IP_PKTINFO socket option handling
Previous by Thread: /etc/rc.d/dhcpd6
Next by Thread: Re: Plan for improving IP_PKTINFO socket option handling
Indexes:

Home | Main Index | Thread Index | Old Index