Subject: Re: TCP ACK convoying....
To: Bill Sommerfeld <sommerfeld@orchard.arlington.ma.us>
From: Jonathan Stone <jonathan@DSG.Stanford.EDU>
List: tech-net
Date: 04/07/1998 16:59:44
hi Bill,

Thanks very much for the feedback, much appreciated!

To me, it does sound very similar to the hypothesis I described
earlier, and I noticed the netisr() in if_loop.c very late last
night. I do think you (or we ;) are onto something.

But even just over loopback, it's not quite as simple as slow-start;
the trace I put up for FTP is clearly in exponential mode. And the
tcp_input() code is calling tcp_output() to send the ACKs.  So, yes,
once the convoys form, even a one data/one ack, the non-reordering
properties of loopback and exponential window-opens should guarantee
that the convoys continue to grow.

This is a longstanding boo-boo, I've seen it on Ultrix, so it must
date back to at least 4.2BSD ;)



I have also seen ACK convoying over several real NICs. As when sending
from a 100Mbit NetbSD host (in my lab, nonstandard 1.3_ALPHA/1.2G plus
tuning hacks,) to a 10Mbit Ethernet host (a DECstation).

I can make the traces available, but it's a bit hard for someone to
set up the same kernel and replicate it.

I've taken a bpf trace on the receiver, which also shows ACK
convoying.  So it's not just the loopback interface,a nd it's not just
ACK compression.  (I'm sorry if I didn't make that clear.)

I've also seen ACK convoying with NetbSD 1.x on a pair of DECWRL T3
cards (the same boards used for hte Sequioa 2000 project) and with
some reseach-prototype NICs in my lab.  The ACK generation in TCP does
look correct, it clearly DTRT for people who arent' so close to the
bleeding edge of what their hardware can do, and so I am not focusing
on the tcp_input() ACK-generation code itself.

On the research-prototype boards, I have set up an external loopback,
and taken bpf wiretaps on one machine, capturing both send and receive
side.  I've examined the resulting tcpdump traces using a modified
perl script from the MosquitoNet project, which expects both send and
receive wiretaps and DTRT with the trace.

I still see ACK convoying, and I see it with a point-to-point
test as well.

so I'm not convinced it's not as simple as a single queue and
non-reordering.

I think the hypothesis I posted last night is the same basic idea as
your explanation for the loopback interface; I just didn't understand
then, just how to apply it to loop-back.  Thanks for  waking me up!

But since, maybe,  you didn't see the similarity, here it is again: 

My idea is that ACK convoying on real interfaces is due to something
like ``receiver livelock'', where the incoming packets arrive just
slightly faster than TCP can process them.  In that world, once a
burst of even two packets arrives, then either the connection has a
``standing burst'' (a convoy) of packets coming off the interface
hardware, or it has no input packets at all.  (Here, i'm assuming
 one connection).

Once the burst of input packets forms, the driver takes input
interrupts and pulls packets off the device and drops them onto the
input queue.  So, maybe the driver and the network stack is giving
predcence to handlign input packets, and either

  a) not giving the receiving  TCP enough CPU to compute and
     enqueue ACKs  for the packets already on the interface input
     queue or higher in the stack;
  b) our TCP is getting enough cycles to compute the ACKs, and
     putting them on the interface output queue, but the
     driver is getting input packet interrupts  fast enough to stop
     it from servicing the output queue.  So, no ACKs go out
     until the entire input convoy arrives;

   c) some mixture of the above.

Viewed that way, it starts to look rather like receive-livelock, or
even like the inverse of the performance optimizations suggested by
Trevor Blackwell (SIGCOMM 96 again).

>I don't see how this generalizes to traffic between multiple systems,
>where you have a different queue in each direction and two processes
>chewing through the packets instead of one.  A test case which
>demonstrates it *in the non-loopback case* would be helpful..

Do you want a trace showing this, or a mechanism where you can
duplicate a trace yourself?  I have oodles of such traces...

For me, doing the posted `do-ttcp' script (which sets up nice large
windows so you can see the effect) on two machines, sending from
100Mbit to 10Mbit works.

I send from a 133MHz Pentium with 100Mbit Ethernet (either DEC 2114x
or even 3com 3c595) to a Decstation with 10Mbit Ethernet, using a
Cisco Catalyst 5000 as the 10-to-100 converter, seems to do the trick.
Certainly sending from the 100Mbit machines to the 10Mbit machine I
have at home, on the other side of an old AGS+, shows ACK convoying
utterly repeatably.