tech-net archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: New class of receive error



On 13/05/2018 10:16, Michael van Elst wrote:

Recently the recv() system call (and variants) were changed to
return the ENOBUFS error for datagram sockets to indicate that
the sender failed to send a packet.

Apparently this only works for local communication like UNIX domain
sockets or the routing socket, as you don't know about sender errors
of a remote system if e.g. UDP is used.

No, this isn't entirely the case.
If any mechanism tries to send data to a socket, regardless of it's source (could be UDP, could be a local domain socket) and there isn't enough buffer space it will now be reported.

I believe that once it's in the system it should't be silently discarded. Comments that have existed in our code as far back as I've checked seem to agree with stuff like this:
/* XXX notify userland of overflow */

The intention was to make data sent to the routing socket more
reliable by informing the receiver that it lost information and
that it has to resynchronize. So far the only user that needs
that kind of information reliably is dhcpcd.

It's true that dhcpcd is the only entity that does something useful with the error and re-loads state using getifaddrs and ioctl.

I would imagine that other applications we have in the base system should really care about this as well. Here's a list from a trivial search and basic examination:
	rarpd, racoon, ifwatchd, route (monitor), rtadvd

A side effect is that programs using UNIX domain datagram sockets
such as syslogd now fail when they can't keep up with messages
(but only for local messages).

I, for one, would be interested in knowing that the logging mechanism is failing. We don't know the importance of the message that would otherwise have been silently discarded. But at least we now know, so we can do something about it.

I don't think that's the right approach. In particular, it shouldn't
be pushed to netbsd-8 close to the release.

If you want reliable transport of messages, then you should use
a reliable transport protocol.

If you want complete routing information use synchronous queries
and use routing socket messages only as an optimization.

I believe that's what I've done?
Open PF_ROUTE socket and listen to it.
In the normal path everything works and kqueue (or whatever select mechanism is in play) lets me know either there is an error or something needs reading.

In the case of ENOBUFS, dhcpcd will take no assumptions about what was lost an will re-sync it's state using getifaddrs and ioctl. This is an expensive operation that monitoring the route socket avoids.

The origin of this change seems to be the special case of Linux
NETLINK sockets. NETLINK sockets serve similar purposes as the
BSD routing socket for passing kernel information to userland
daemons. Reading from them may yield ENOBUFS errors to inform
userland that messages got lost.  This behaviour is also controlled
by the NETLINK_NO_ENOBUFS socket flag, so it can be turned off
again.

I see no reason to single out Linux here, nor a specific type of socket.
From consulting other OS documentation it seems that only the BSD family does not raise ENOBUFS for a recv call. It's certainly documented by POSIX:
http://pubs.opengroup.org/onlinepubs/000095399/functions/recv.html

So any portable application can rightly expect to see ENOBUFS calling recv.

Roy


Home | Main Index | Thread Index | Old Index