Subject: patch! Re: sbappend() is not scalable
To: None <tech-net@netbsd.org>
From: Alfred Perlstein <bright@wintelcom.net>
List: tech-net
Date: 10/11/1999 18:04:55
On Fri, 8 Oct 1999, Mohit Aron wrote:

> Hi,
> 	I recently did some experiments with TCP over a high b/w-delay path
> and found a scalability problem in sbappend(). The experimental setup
> consisted of a 100Mbps network with a round-trip delay of 100ms. Under this
> situation, FreeBSD's TCP version is incapable of attaining more than 65 Mbps
> on a 300MHz Pentium II - even without slow-start.
> 
> I tracked down the problem to sbappend() - the routine that appends user data
> into the socket buffers for network transmission. Every time a TCP ACK 
> acknowledges some data, space is created in the socket buffer that permits
> more data to be appended. Unfortunately, the implementation does not maintain
> a pointer to the end of the list of mbufs in the socket buffer. Thus each 
> time any data is added, the whole list of mbufs is traversed to reach the 
> very end where the data is added. Since the b/w-delay product is large, there
> can be about 600 mbufs in the socket buffer waiting to be acknowledged. Thus
> upon every ACK, about 600 mbufs are traversed causing the TCP sender to run 
> out of CPU.
> 
> The problem is not limited only to high b/w networks - it is also present in
> long latency paths (satellite links). Thus a server transferring a large file
> over a satellite link can spend lot of CPU due to the above problem.
> 
> Hope the problem shall be fixed in future releases,
> 

I'm not sure how well these patches will apply under NetBSD but I've
got something in the works for FreeBSD, it seems to work pretty ok
but I'd like a larger audiance to test it out.

http://www.freebsd.org/~alfred/sockbuf4.diff

The patches are for FreeBSD-current as of this morning.

The patches also have a smarter (imo) version of sbcompress()
that will attempt to copy less data if it can.  I apologize
in advance for the gratuitous style changes but I needed to
make the code a bit more readable.

Any feedback would be much appreciated as I don't have a
high delay and bandwith LAN to work with.

I'm also pretty sure i'm not sub'd to this list, so please
don't neglect to cc me.

thanks,
-Alfred Perlstein - [bright@rush.net|alfred@freebsd.org]
Wintelcom systems administrator and programmer
   - http://www.wintelcom.net/ [bright@wintelcom.net]