Subject: TCP backoff behavior in NetBSD 1.0 vs. SunOS 4.1.3_U1
To: <>
From: Greg Earle <earle@isolar.Tujunga.CA.US>
List: current-users
Date: 11/29/1994 04:15:11
Here's something weird I just noticed.

I have NetBSD/SPARC 1.0 running on a Sun SPARCstation IPC at work (its TCP/IP
is based on the 4.4BSD-lite version by now, I think).  I just FTP'd a 1.2 Mb
file from it down to my SPARCstation 1 clone running SunOS 4.1.3 at home, over
a PPP link running DP 2.3 pl2 between the home machine and a Sun-4/370 at work.

Normally I'll get 1.5 Kb/sec transfers of gzip'ped files over this PPP link,
but this time the throughput was terrible - 0.9 Kb/sec.  There were several
times when the transfer came to a complete halt, for many - some times in the
tens - of seconds.

A "tcpdump" of the last "pregnant pause" showed something interesting:

...
03:16:56.257706 netbsd4me.ftp-data > isolar.3090: P 177031:178491(1460) ack 1 win 17520 [tos 0x8] (ttl 63, id 60576)
03:16:56.440477 isolar.3090 > netbsd4me.ftp-data: . ack 178491 win 24576 (ttl 60, id 24254)
03:16:57.154914 netbsd4me.ftp-data > isolar.3090: . 178491:179951(1460) ack 1 win 17520 [tos 0x8] (ttl 63, id 60577)
03:16:57.240599 isolar.3090 > netbsd4me.ftp-data: . ack 179951 win 24576 (ttl 60, id 24255)
03:16:58.875522 netbsd4me.ftp-data > isolar.3090: . 179951:181411(1460) ack 1 win 17520 [tos 0x8] (ttl 63, id 60579)
03:16:59.775012 netbsd4me.ftp-data > isolar.3090: . 181411:182871(1460) ack 1 win 17520 [tos 0x8] (ttl 63, id 60580)
03:17:00.645779 netbsd4me.ftp-data > isolar.3090: . 182871:184331(1460) ack 1 win 17520 [tos 0x8] (ttl 63, id 60581)
03:17:01.516060 netbsd4me.ftp-data > isolar.3090: P 184331:185791(1460) ack 1 win 17520 [tos 0x8] (ttl 63, id 60582)
03:17:02.415639 netbsd4me.ftp-data > isolar.3090: . 185791:187251(1460) ack 1 win 17520 [tos 0x8] (ttl 63, id 60583)
03:17:03.285962 netbsd4me.ftp-data > isolar.3090: P 187251:188711(1460) ack 1 win 17520 [tos 0x8] (ttl 63, id 60584)
03:17:04.184818 netbsd4me.ftp-data > isolar.3090: . 188711:190171(1460) ack 1 win 17520 [tos 0x8] (ttl 63, id 60586)
03:17:05.055984 netbsd4me.ftp-data > isolar.3090: . 190171:191631(1460) ack 1 win 17520 [tos 0x8] (ttl 63, id 60587)
03:17:05.925245 netbsd4me.ftp-data > isolar.3090: P 191631:193091(1460) ack 1 win 17520 [tos 0x8] (ttl 63, id 60588)
03:17:06.826240 netbsd4me.ftp-data > isolar.3090: . 193091:194551(1460) ack 1 win 17520 [tos 0x8] (ttl 63, id 60589)
03:17:07.665512 netbsd4me.ftp-data > isolar.3090: P 194551:196011(1460) ack 1 win 17520 [tos 0x8] (ttl 63, id 60590)
03:17:13.574226 netbsd4me.ftp-data > isolar.3090: . 179951:181411(1460) ack 1 win 17520 [tos 0x8] (ttl 63, id 60594)
03:17:13.620497 isolar.3090 > netbsd4me.ftp-data: . ack 181411 win 24576 (ttl 60, id 24260)
03:17:14.716071 netbsd4me.ftp-data > isolar.3090: . 181411:182871(1460) ack 1 win 17520 [tos 0x8] (ttl 63, id 60595)
03:17:14.820433 isolar.3090 > netbsd4me.ftp-data: . ack 182871 win 24576 (ttl 60, id 24261)
...

("netbsd4me" is the remote NetBSD/SPARC 1.0 machine.)

You'll notice that at ~ 3:16:59, the remote end sends the first packet that
doesn't get an ACK back from my local side.  It then proceeds to fill the TCP
window over the next 9 seconds with 12 packets; then there is a "pregnant
pause" (still no reply from my side) of nearly 6 more seconds of "dead air"
before the remote end re-sends the first packet in the sequence; note that
according to the "id" field it jumped from "id 60590" to "id 60594".  The
first thing I would like to know is "What happened to id 60591, 60592, and 3?"

The next thing is, the "id" field on the local side's return ACKs jumped from
"id 24255" up to "id 24260".  What happened to id's 24256 through 24259?  Did
they get used by other apps?

Finally, can anything be tuned on either end to help avoid this "pregnant
pause" condition?  I just transferred a file from another SunOS 4.1.3_U1
system and it only got in this state once; the remote end only sent 6 packets
and then delayed only about 0.71 seconds (normal between-packet times for
the 6 packets was ~0.39 seconds, i.e. not much more than the modem turnaround
times) before re-sending the original packet in the stream:

03:50:11.100608 isolar.3164 > tehachapis.ftp-data: . ack 77097 win 24576 (ttl 60, id 25027)
03:50:11.756893 tehachapis.ftp-data > isolar.3164: . 77097:78557(1460) ack 1 win 24576 (ttl 59, id 63566)
03:50:12.138162 tehachapis.ftp-data > isolar.3164: . 78557:80017(1460) ack 1 win 24576 (ttl 59, id 63567)
03:50:12.507091 tehachapis.ftp-data > isolar.3164: . 80017:81477(1460) ack 1 win 24576 (ttl 59, id 63568)
03:50:12.896614 tehachapis.ftp-data > isolar.3164: . 81477:82937(1460) ack 1 win 24576 (ttl 59, id 63569)
03:50:13.294590 tehachapis.ftp-data > isolar.3164: . 82937:84397(1460) ack 1 win 24576 (ttl 59, id 63570)
03:50:13.651138 tehachapis.ftp-data > isolar.3164: . 84397:85857(1460) ack 1 win 24576 (ttl 59, id 63571)
03:50:14.364483 tehachapis.ftp-data > isolar.3164: . 77097:78557(1460) ack 1 win 24576 (ttl 59, id 63576)
03:50:14.500472 isolar.3164 > tehachapis.ftp-data: . ack 78557 win 24576 (ttl 60, id 25030)
03:50:15.176559 tehachapis.ftp-data > isolar.3164: . 78557:80017(1460) ack 1 win 24576 (ttl 59, id 63577)

("tehachapis" is the remote SunOS 4.1.3_U1 machine.)

In short, it looks like when there is a hiccup at the local end, the NetBSD
machine "fills the pipe" with more data that it eventually has to retransmit,
and when it finally realizes that it has to re-send, it waited a REALLY long
time to do so ... whereas the hoary old SunOS machine fills the pipe with
only half as many packets, and when it stops, it backs off for a very short
time before re-sending.  When doing TCP over a slow modem PPP link, this
difference in behavior shows up as a painfully poor throughput connection.
(Notice how essentially no packets got through for ~ 15 seconds, since all
 of them had to be re-sent.)

I know that part of the problem is down in the driver level, like in the
SunOS zs_async driver.  Consider this "pregnant pause" with a much smaller
advertised window:

...
04:07:04.353561 elroy.2005 > isolar.uucp: . 328089:329549(1460) ack 2605 win 24576 (ttl 59, id 12891)
04:07:04.425405 isolar.uucp > elroy.2005: . ack 329549 win 4088 (ttl 60, id 25685)
04:07:06.055073 elroy.2005 > isolar.uucp: . 329549:331009(1460) ack 2605 win 24576 (ttl 59, id 12931)
04:07:22.404890 elroy.2005 > isolar.uucp: . 329549:331009(1460) ack 2605 win 24576 (ttl 59, id 14016)
04:07:22.434260 isolar.uucp > elroy.2005: . ack 331009 win 4096 (ttl 60, id 25690)

On UUCP-over-TCP, my local side advertises only a 4K window, so the other side
doesn't fill the pipe with packets that end up needing to be retransmitted;
on the other hand, notice how there was still a *16 second* pause between the
first packet and the re-transmission!!!  Ouch!  (During which another series
of 4 "id" numbers - 25686 through 25689 - went AWOL.)

I don't get this.  I would have assumed that the NetBSD/SPARC 1.0 machine
would be much smarter, considering as how the SunOS 4.1.3_U1 TCP is
practically from the Paleolithic Era at this point.  If I'm missing something
obvious, please clunk me over the head with an ICMP Dumbshit Poster or
something ...

Any ideas, anyone?

TIA,

	- Greg