Subject: Re: Melting down your network [Subject changed]
To: None <tech-kern@netbsd.org>
From: Christos Zoulas <christos@tac.gw.com>
List: tech-kern
Date: 03/29/2005 11:07:35
In article <200503291523.j2TFNdcx003265@ector.cs.purdue.edu>,
J Chapman Flack  <flack@cs.purdue.edu> wrote:
>Alan Barrett enumerated options:
>>   a) drop the packet and return 0
>>   b) drop the packet and return ENOBUFS
>>   c) delay the packet until it can be sent, then return 0
>
>uhrm, if Jonathan's position is that only *actually dropping packets* can
>achieve desired congestion control properties, and an *actually dropped
>packet* is one that you had every reason to believe you successfully sent,
>should there be an option
>
>    d) drop the packet and return success, that is, the packet length  ?
>
>In all the other cases, I'm not convinced we're talking about *dropping*
>packets, so much as something more like, umm, *declining* packets, and (as
>I think Christos may have been getting at) it's at least not transparently
>clear to me that the congestion control effects of dropping and declining are
>identical.

Yes, thanks for stating this better. The effect of dropping a single
packet at the output queue level, is effectively declining that single
packet and letting the application retry.

Imagine if every piece of hardware was designed this way: you need
to busy wait to detect completion of the previous operation instead
of getting an interrupt indicating that the device is ready to
accept the next command. I am sure that the people who wrote device
drivers would not be very happy.

This discussion has been helpful, specially in light of Michael van Elst's
tests on what happens in other OS's:

1. Solaris, Linux waits
2. AIX: Interface queue unlimited, returns ENOBUFS when mbufs are exhausted
3. IRIX, BSD: Interface queue bound, returns ENOBUFS when queue is full

Since we are talking about send(2) in blocking mode, from an
application perspective [1] is ideal. AIX [2] seems to have taken
the BSD network stack, removed the interface output queue count
check and decided not to pay attention mbuf resource exhaustion
issue.  This does not really fix the problem; it just pushes the
resource issue somewhere down the chain.  IRIX and BSD [3], just
punted because the way the network stack is designed, it is difficult
to:

	a. sleep and wait for the queue to have space this late in the game.
	b. communicate to the upper layer enough information about the error
	   so that the kernel can retry instead of giving up and propagating
	   the error to userland.

From my perspective we can:

	a. Leave things as they are. This is not such a big deal, all
	   the application has to do to work is:
		while ((error = send(...)) == ENOBUFS)
			continue;
	   This issue has been there for more than 20 years and we are not
	   the only OS that has punted fixing it.

	b. Fix the problem by having the kernel wait until there is space
	   in the output queue. This would be nice to have, but not very
	   easy to make work properly in all cases.

We should not:

	c. Remove the count limit in the output queue. It does not help,
	   and we'll fail the same way AIX does: running out of mbufs.

	d. Drop packets silently by returning 0 instead of ENOBUFS when
	   the output queue becomes full. This just hides the error and
	   pushes the problem to a higher layer where it is more expensive
	   to fix.

christos