Subject: TCP ACK convoying....
To: Jason Thorpe <thorpej@nas.nasa.gov>
From: Jonathan Stone <jonathan@DSG.Stanford.EDU>
List: tech-net
Date: 04/07/1998 01:01:19
I think maybe ti's time to split the two threads here.
This message is purely about the TCP ACK convoying, lets leave the
in_maxmtu to its own thing.
Someone asked me in private email why I wasn't saying how to reproduce
this, so maybe I need to be clearer.
The four-line shell command sequence I posted before, which shows the
lo0-MTU bug, *also* reproduces the TCP packet-convoying performance
``bug'' over the local-loopback interface.
At least, it does for me, on three different ioasic-based DECstations,
running -current as of about a week ago. That's using the supplied
arguments to ttcp, built from our package system, to send a large
transfer over the local-loopback interface. And also on my own lab
machines, but those are a weird special case.
I've taken another trace on a machine set up on an isloated network,
doing nothing else except the sending and receiving ttcp and the
tcpdump. I've put the trace on ftp.netbsd.org, in
pub/NetBSD/arch/pmax/TTCP-TRACE. Since it's large and not really
relevant to pmaxes, it'll only stay for a couple of days.
The trace posted shows the exponential opening of TCP's window very
nicely. But the exponential window-open is so clear only because all
hte data packets go out in a burst, followed by all the ACK packets.
After a brief skinm at how tcp_input is enqueing ACKs in the
header-prediction code, this looks rather odd. (all these packets
should be header-prediction hits, handled in tcp_input, which
definitely seems to be emitting ACKs as the packets come in.
I've also seen the same effect sending from a 100Mbit ethernet (an
i386 running my own 1.2G/1.3_BETA tree, plus bcopy tuning, plus
interrupt-handling tuning, plus copyin()/copyout() tuning, plus
interrupt-hanlder tuning). And I've seen the same effect with two
different generations of research-prototype high-speed NICS.
Both the NICs were capable of sustaining speeds faster than the CPU
and TCP stack could copy and checksum data to feed them. It seems to
be throughput-related: if the CPU has more than enough grunt to keep
the media happy, it doesn't show up.
My current best guess, from a couple of days ago, is that the TCP code
acutally *is* correct-- it certainly looks like it is -- but that the
real problem is starvation inside the chipset-level driver.
Specifically, that the receive-side processing at the hardware level
is taking precedence to the send-side: one packet arrives, gets handed
off to TCP, and as more receive-side packets arrive, the device-level
output handler never gets to pull packets off the if_output queue and
start sending them.
Since that's where the bpf level wiretapping is done, it fitws the
observed symptoms. I dont yet see how it explains the local-loopback
effect, though, since there there's really only one queue.
I've thought about this ACK-convoying hard enough,and long enough--
going back back years-- that I'm starting to think it's worth adding
to Vern's I-D. Anyone who goes to IETFs got a better handle on how to
do that than just emailing Vern Paxson? I trust by now I've done
enough to establish clear precedence at identifying this bug? ;)