Subject: Socket buffer accounting and TCP
To: None <tech-kern@netbsd.org>
From: Charles M. Hannum <mycroft@mit.edu>
List: tech-kern
Date: 09/02/1998 07:07:01
Socket buffer accounting is done in two ways:

* by bytes used
* by actual mbuf space allocated

The idea is to limit the amount of space used even if we are dealing
with a bunch of tiny packets.

On the transmit side, the TCP stack sends a window size corresponding
to the amount of space available, using sbspace(), which DTRT.  On the
receive side, it uses sbcompress() to compact mbufs -- but this
applies only to small mbufs, and not clusters.  It does not check that
there is space available before passing data to sbappend(); it assumes
that once it's advertised the right side of the window, it will not
move to the left (so to speak).

Now consider a pathological case.  If I advertise a 64k window, and
the sending side proceeds to send me 1092 60-byte packets (which will
all have been put in clusters when we receive them, because they are
too large for a mbuf with the protocol headers added) before getting
another window update, we will happily accept the data, and my receive
buffer will now consume >1MB of space.

The sending TCP attempts to prevent an excessive number of small
packets by using the `silly window avoidance' algorithm.  However, a
number of applications disable this using TCP_NODELAY.


Which leads me to a (slighly amusing) problem that I've been
experiencing.  If I suspend a remote CVS client process running over
SSH during the comparison phase, it hits exactly this condition, and
consumes all available mbuf space, preventing the machine from doing
any further network traffic.

This is pretty lame and annoying.

It seems to me that there are several issues here:

1) The input side of the TCP stack needs to actually check that there
   is space available before calling sbappend().  This will prevent
   the socket from consuming more than its share of space, but may
   lead to performance issues in this case because only a few packets
   will fit in the socket buffer, the window will rapidly shrink, and
   a bunch of packets already on the network will end up being thrown
   away.

2) It may be worth changing sbcompress() to compact clusters as well.
   This can be tweaked to check for some threshold; e.g. allow up to
   1/2 of the space to be wasted to avoid excessive copying.

3) It's unclear to me that SSH should be setting TCP_NODELAY at all
   for non-interactive use.  (This may actually be fixed in later
   versions.)


There's also an interesting question of how this case applies to
buffer sharing with fixed-sized memory pools used by Ethernet cards.
It seems to me that one viable option is to allow passing a shared
buffer into the network stack, but then *always* copy the data in
sbappend() (i.e. after all the headers have been stripped, and doing
compaction at the same time).  We could also put zero-copy hooks here
to avoid the extra copy in that case.


I'd appreciate any constructive thoughts on this.