Subject: Re: NetBSD TCP strangeness (was: problems using nbcvs)
To: John Klos <john@ziaspace.com>
From: Jason R Thorpe <thorpej@wasabisystems.com>
List: tech-net
Date: 01/28/2003 17:10:23
On Tue, Jan 28, 2003 at 05:41:16PM -0500, John Klos wrote:

 > >From behind IP NAT using IP Filter (tried NetBSD 1.5.2, 1.5.3, 1.6, 1.6
 > release from two weeks ago, FreeBSD 4.6.2), all Mac OS X machines get
 > anywhere from .5 k to 10 k/sec from my server. Note that all of the NATs
 > tested were not PPPoE or anything that requires a reduced MTU.
 > 
 > >From behind all of those NAT, Linux, Windows, Mac OS classic, AmigaDOS,
 > and NetBSD all get expected speeds:
 > gaia: {4} ftp -4 http://www.sixgirls.org/~liz/cry.mp3
 > Requesting http://www.sixgirls.org/~liz/cry.mp3
 > 100% |*************************************| 4228 KB 273.25 KB/s 00:00 ETA
 > 
 > I don't have any OS X machines that are not behind NAT, but if anyone else
 > can test this, I'd like to know if the problem is specifically due to IP
 > Filter's NAT and how it works with OS X. Also, the OS X machines don't
 > have any problems with most other servers.

I suspect you are experiencing a bug in the Mac OS X TCP.  A while ago,
a friend noticed that he was getting horrible performance from his OS X
system to his NetBSD-based file server, which was located on another
subnet.

After doing some digging, we determined (by looking at tcpdumps) that
OS X has the "stretch ACK bug", something they inherited from FreeBSD
(and something that we fixed in NetBSD .. years ago).

Summary of the bug is:

	Due to a disconnect between what the sender and receiver think
	is a maximum-sized segment (specifically, the receiver thinks
	the maximum-sized segment is larger than what the sender actually
	uses as the maximum-sized segment), the receiver erroneously delays
	sending ACKs to the server, causing the server to delay sening more
	data (because the CWND opens up slowly).

More lengthy explanation:

	A TCP receiver using delayed ACK (which nearly every TCP does) is
	supposed to send an ACK for every 2 maximum-sized segments it
	receives.  Some TCP implementations use a simple byte counter to
	make this determination.  A timer is used to force an ACK in the
	event the second packet never comes.

	Consider what happens if the receiver thinks the sender's maximum-
	sized segment is 1400, yet the sender is actually using a maximum-
	sized segment of 512.  The sender will send 2 segments, but since
	that adds up to only 1024, the receiver continues to wait for more
	data, but the sender is waiting for an ACK in order to open up
	the CWND.

	Eventually, the receiver's delayed ACK timer fires, and it sends
	the ACK, thus allowing the sender to send more data.

This is, without a doubt, a problem with the receiver.  The receiver has
no way of knowing for certain what the sender can use for a maximum-sized
segment on transmit (this is especially true in asymmetric routing
situations).

There are a couple of solutions, both of which must be implemeted on the
receiver:

	* Be clever about determining what the sender considers to
	  be a maximum-sized segment, and adjust the delayed ACK
	  threshold accordingly.

	* Simply ACK every two packets.

The latter is what NetBSD chose to do; it's simple and effective.

This bug has been documented for quite some time now, and NetBSD fixed
It thusly (this is from the tcp_input.c CVS log):

----------------------------
revision 1.37
date: 1997/12/11 06:33:29;  author: thorpej;  state: Exp;  lines: +19 -9
Fix the "stretch ACK violation" bug documented in internet draft
draft-ietf-tcpimpl-prob-02.txt.  Also, fix another bug in the header
prediction case where an ACK would not be sent when it should be.
----------------------------

I know the person who was bitten by this reported it to a clueful friend
at Apple, but it was not fixed for the Jaguar release.

Of course, Apple *should* have used NetBSD's TCP, but they had people
beating the FreeBSD drum there, and now all of us Mac users have the
pleasure of experiencing the consequences of that choice.

-- 
        -- Jason R. Thorpe <thorpej@wasabisystems.com>