tech-kern: Re: fixing send(2) semantics (kern/29750)

Subject: Re: fixing send(2) semantics (kern/29750)
To: None <tech-kern@netbsd.org>
From: Christos Zoulas <christos@tac.gw.com>
List: tech-kern
Date: 03/27/2005 00:16:42
In article <200503270447.XAA22242@Sparkle.Rodents.Montreal.QC.CA>,
der Mouse  <mouse@Rodents.Montreal.QC.CA> wrote:
>>> [...], one need say no more than repeat the observation that, under
>>> heavy network load as evidenced by full queues, one is better off to
>>> drop packets at their source than to try and resorces sending them
>>> into the network, only to have them dropped later.
>> This is a different scenario.  The cpu is a lot faster than the nic
>> card, and the nic card cannot absorb packets quickly enough to send
>> it out to the network.  It is not congestion in the network fabric,
>> but internal congestion.
>
>This is no theoretically different from an infinitely fast NIC (or
>"faster than any other device" if you don't like infinities) to a
>switch that stops down to the actual wire speed.  In particular, the
>theoretical congestion results that tell you what you should do as a
>router when your buffers fill up apply equally well here.

It is different practically since the network is not congested. The
best the computer can do is to wait until there is space in the
output queue. If it fails like it does now, it will actually cause
more work and slower transmission because it is going to drop packets.

>> There is also currently no way to rate limit send so that it does not
>> return ENOBUFS from the application side, [...].  I.e. I cannot even
>> select or poll before I send, in order to avoid gettting ENOBUFS.
>
>Yes.
>
>> and this is clearly broken.
>
>No.  At least, it's not clear to me.
>
>You can't *guarantee* to avoid ENOBUFS, even if we made SOCK_DGRAM
>sockets poll()able for write, since between the time poll shows
>writable and the time you call send other processes could fill up the
>interface's output queue - and likely will, if you have multiple
>processes using this technique to wait for space on the same
>interface's output queue.

Guarantee is one thing; being able at least to indicate to the application
that it is ok to try again is another. In this case the select/poll always
succeeds and returns immediately. So the application will either spin or
will have to sleep with an artificial timeout that is unrelated to the
congestion level.

christos