Subject: TCP ACK convoying....
To: Jason Thorpe <thorpej@nas.nasa.gov>
From: Jonathan Stone <jonathan@DSG.Stanford.EDU>
List: tech-net
Date: 04/07/1998 01:01:19
I think maybe ti's time to split the two threads here.

This message is purely about the TCP ACK convoying, lets leave the
in_maxmtu to its own thing.

Someone asked me in private email why I wasn't saying how to reproduce
this, so maybe I need to be clearer.

The four-line shell command sequence I posted before, which shows the
lo0-MTU bug, *also* reproduces the TCP packet-convoying performance
``bug'' over the local-loopback interface.

At least, it does for me, on three different ioasic-based DECstations,
running -current as of about a week ago.  That's using the supplied
arguments to ttcp, built from our package system, to send a large
transfer over the local-loopback interface. And also on my own lab
machines, but those are a weird special case.

I've taken another trace on a machine set up on an isloated network,
doing nothing else except the sending and receiving ttcp and the
tcpdump. I've put the trace on ftp.netbsd.org, in
pub/NetBSD/arch/pmax/TTCP-TRACE.  Since it's large and not really
relevant to pmaxes, it'll only stay for a couple of days.

The trace posted shows the exponential opening of TCP's window very
nicely.  But the exponential window-open is so clear only because all
hte data packets go out in a burst, followed by all the ACK packets.

After a brief skinm at how tcp_input is enqueing ACKs in the
header-prediction code, this looks rather odd.  (all these packets
should be header-prediction hits, handled in tcp_input, which
definitely seems to be emitting ACKs as the packets come in.

I've also seen the same effect sending from a 100Mbit ethernet (an
i386 running my own 1.2G/1.3_BETA tree, plus bcopy tuning, plus
interrupt-handling tuning, plus copyin()/copyout() tuning, plus
interrupt-hanlder tuning).  And I've seen the same effect with two
different generations of research-prototype high-speed NICS.  

Both the NICs were capable of sustaining speeds faster than the CPU
and TCP stack could copy and checksum data to feed them.  It seems to
be throughput-related: if the CPU has more than enough grunt to keep
the media happy, it doesn't show up.

My current best guess, from a couple of days ago, is that the TCP code
acutally *is* correct-- it certainly looks like it is -- but that the
real problem is starvation inside the chipset-level driver.
Specifically, that the receive-side processing at the hardware level
is taking precedence to the send-side: one packet arrives, gets handed
off to TCP, and as more receive-side packets arrive, the device-level
output handler never gets to pull packets off the if_output queue and
start sending them.

Since that's where the bpf level wiretapping is done, it fitws the
observed symptoms.  I dont yet see how it explains the local-loopback
effect, though, since there there's really only one queue.

I've thought about this ACK-convoying hard enough,and long enough--
going back back years-- that I'm starting to think it's worth adding
to Vern's I-D.  Anyone who goes to IETFs got a better handle on how to
do that than just emailing Vern Paxson?  I trust by now I've done
enough to establish clear precedence at identifying this bug? ;)