Subject: TCP_NODELAY and full links (was Re: sup problems?)
To: NetBSD-current Discussion List <current-users@netbsd.org>
From: Sean Doran <smd@ebone.net>
List: current-users
Date: 09/29/1999 01:59:41
woods@most.weird.com (Greg A. Woods) writes:

> BTW, because of use of TCP_NODELAY in SSH, rsync over SSH makes such
> effective use of the maximum bandwidth of a narrow bandwidth pipe that
> it is literally impossible to get a TCP handshake to complete
> successfully without first stopping or pausing the rsync and/or ssh
> process!  That would suggest to me that even alone rsync makes very good
> use of all available bandwidth

No, it suggests that something is leading the TCP
transmitter to occupy far too many buffers at the
bottleneck, such that it is interfering with other
traffic.  This is very bad, TCP unfriendly, behaviour.

I doubt Nagle being disabled is the problem; check the
segment sizes using traceroute -- I am willing to bet that
they are all MSS-sized.  The 1.2.27 code turns on NODELAY
only when the session is interactive, which should not be
the case here.  However, you may be a victim of too-small
an MSS (the default Internet MSS is simply unrealistically
small), which has much the same effect as disabling Nagle.

There should only ever be (bandwidth * delay) worth of
data in flight between the transmitter and the receiver;
this maximizes throughput on quiescent links.  Maintaining
a persistent queue anywhere is simply wasteful of
resources at best, and can contribute to bad congestion
(as in your case, where you are unable to get other work
done), or to congestion collapse, where the TCP sender's
own throughput suffers along with everyone else.

In short, the transmitter is sending at a rate a little
higher than is necessary to keep the link full; this
steals buffers away at the bottleneck that could otherwise
be used to absorb transient bursts of traffic, such as the
start up of a new TCP bulk transfer (e.g. clicking on a
web page or starting up an ssh transaction).

See Van Jacobson's beautiful and informative slide set at
ftp://ftp.ee.lbl.gov/talks/vj-nanog-red.ps.gz
(particularly slide 11) and also
http://www.psc.edu/networking/tcp_friendly.html for
Jamshid Mahdavi and Sally Floyd's excellent page dealing
with how TCP ought to behave in the real world.  There are
many papers linked from there that deal with the rather
ugly and very common problem you are experiencing.

Part of the problem is that the control law at the
bottleneck (presumably a small router) is a simple FIFO, 
but a real part of the problem is aggressive window
inflation.

TCP_NODELAY turns off the Nagle algorithm, which prevents
the transmission of small packets by not transmitting
until there is a full segment available, or until a
(short) timer has expired.  In practise, this means that a
"write(tcpsock, buf, 1);" system call is made, a TCP
segment with excatly one data byte is immediately
transmitted.   If ssh/scp/whatever is doing small writes,
you get small packets.   If Nagle is on, several
back-to-back small "write(tcpsock, buf, 1);" calls will
result in a much larger single packet being transmitted.

The "much larger" is limited by the MSS available to the
connection.  If the MSS is small, you can still run into
some problems.

TCP transmission is clocked by ACKs: the more packets that
are sent, the more ACKs will return, and therefore, the
quicker the sender's congestion window will grow.
Likewise, a large number of smaller packets will allow for
more opportunities for fast retransmit and fast recovery,
which will tend to keep the transmitting rate high.

The problem is that the transmitting rate is *too* high,
largely because TCP is conservative in its timing
strategy.  In large part this is because of historical
clock/timer resolution limitations.  Congestion-avoiding
transmitters should maintain a better srtt (using RFC 1323
RTTM timestamps) and reduce their sending rate when srtt
begins to grow, as increasing measured RTTs are indicative
of incipient congestion.

Furthermore, for a bulk transfer, where interactive
response time is not an issue, disabling the Nagle delay
is counterproductive: the smaller your packets are, the
less data is transferred application-to-application over a
particular time, because of IP and TCP header overhead.

In other words, TCP_NODELAY for a bulk transfer, if indeed
it does lead to the transmission of smaller than MSS
packets over any path where there is a sizeable round-trip
delay and where there is a bottleneck, is simply wrong.

An MSS that is small (the default is small) is also
inefficient for the same reasons.

TCP_NODELAY and small segments are not your friends here.
You have not only filled your link, but also all your
bottleneck buffers, and this is hurting your ability to
establish or sustain other TCP flows across that link.

Personally, I don't think you should be happy about that at all.

        Sean.