Subject: SOLVED! The cause of puzzling TCP (eg. WHOIS) connection failures with some InterNIC.net hosts
To: North America Network Operators Group <nanog@merit.edu>
From: Greg A. Woods <woods@most.weird.com>
List: tech-net
Date: 11/20/1998 16:25:11
[[ NOTE:  this message is cross-posted to NANOG and tech-net, as well as
Cc'ed to Mark Kosters (because I don't know if Mark reads either list).
Please reply either directly to me, or to *one* of those lists as
appropriate (it's probably not relevant to discuss on NANOG now that the
problem has been identified unless it's a problem with some particular
piece of equipment, in which case it would be good to identify it so
others can fix similar problems. ]]

I've discovered the cause of those problems with TCP connections to/from
some InterNIC.net hosts (and some other hosts, one of which was trying
to send me e-mail and thus necessitated that I debug it in more detail).

Now that I know the cause I can say that this problem is usually
indicative of a firewall with a non-compliant TCP/IP implementation,
though it may also indicate an unwise firewall filtering policy too.

The problem has to do with the failure of a host to fragment larger
packets on demand (i.e. when the other host sends an ICMP "needs frag"
notification).  This may be because the ICMP packet never gets through
(perhaps someone who didn't understand TCP/IP and ICMP and everything
else related implemented a filter on all "abnormal" ICMP packets); or it
may be because the receiving host doesn't understand the ICMP "needs
frag" request (and also doesn't implement path MTU discovery, or have I
got that backwards?).

No matter what the problem really is, I'm sure a *lot* of people would
be much happier if this problem were fixed, specifically for the WHOIS
service (though I've also had troubles receiving HTTP too).  I got quite
a few replies about similar experiences when I first posted about this
on NANOG recently.

Here's a sample trace collected from the PPP router upstream which shows
the outgoing ICMP packets and the incoming TCP retransmissions,
un-fragmented, even after the first request to fragment:

15:02:56.097980 204.92.254.2.4721 > 198.41.0.6.43: S 1660910424:1660910424(0) win 16384 <mss 1460,nop,wscale 0,nop,nop,timestamp 5148233 0> (ttl 62, id 25948)
15:02:56.420273 198.41.0.6.43 > 204.92.254.2.4721: S 1062510833:1062510833(0) ack 1660910425 win 8760 <mss 1460> (DF) (ttl 245, id 4189)
15:02:56.674783 204.92.254.2.4721 > 198.41.0.6.43: . ack 1 win 17520 (ttl 62, id 25951)
15:02:56.677143 204.92.254.2.4721 > 198.41.0.6.43: P 1:6(5) ack 1 win 17520 (ttl 62, id 25952)
15:02:57.175854 198.41.0.6.43 > 204.92.254.2.4721: . ack 6 win 8760 (DF) (ttl 245, id 4190)
15:02:59.393169 198.41.0.6.43 > 204.92.254.2.4721: P 1:4(3) ack 6 win 8760 (DF) (ttl 245, id 4191)
15:02:59.532326 204.92.254.2.4721 > 198.41.0.6.43: . ack 4 win 17517 (ttl 62, id 25994)
15:03:00.326761 198.41.0.6.43 > 204.92.254.2.4721: . 4:1464(1460) ack 6 win 8760 (DF) (ttl 245, id 4192)
15:03:00.327688 204.29.161.41 > 198.41.0.6: icmp: 204.92.254.2 unreachable - need to frag (mtu 1006) (DF) (ttl 255, id 19390)
15:03:00.420416 198.41.0.6.43 > 204.92.254.2.4721: . 1464:2914(1450) ack 6 win 8760 (DF) (ttl 245, id 4193)
15:03:00.421157 204.29.161.41 > 198.41.0.6: icmp: 204.92.254.2 unreachable - need to frag (mtu 1006) (DF) (ttl 255, id 19391)
15:03:03.381245 198.41.0.6.43 > 204.92.254.2.4721: . 4:1464(1460) ack 6 win 8760 (DF) (ttl 245, id 4194)
15:03:03.382120 204.29.161.41 > 198.41.0.6: icmp: 204.92.254.2 unreachable - need to frag (mtu 1006) (DF) (ttl 255, id 19392)
15:03:10.619116 198.41.0.6.43 > 204.92.254.2.4721: . 4:1464(1460) ack 6 win 8760 (DF) (ttl 245, id 4195)
15:03:10.620110 204.29.161.41 > 198.41.0.6: icmp: 204.92.254.2 unreachable - need to frag (mtu 1006) (DF) (ttl 255, id 19411)
15:03:24.974732 198.41.0.6.43 > 204.92.254.2.4721: . 4:1464(1460) ack 6 win 8760 (DF) (ttl 245, id 4196)
15:03:24.975626 204.29.161.41 > 198.41.0.6: icmp: 204.92.254.2 unreachable - need to frag (mtu 1006) (DF) (ttl 255, id 19413)
15:03:53.941690 198.41.0.6.43 > 204.92.254.2.4721: . 4:1464(1460) ack 6 win 8760 (DF) (ttl 245, id 4197)
15:03:53.942656 204.29.161.41 > 198.41.0.6: icmp: 204.92.254.2 unreachable - need to frag (mtu 1006) (DF) (ttl 255, id 19418)
15:04:50.256764 198.41.0.6.43 > 204.92.254.2.4721: . 4:1464(1460) ack 6 win 8760 (DF) (ttl 245, id 52333)
15:04:50.257959 204.29.161.41 > 198.41.0.6: icmp: 204.92.254.2 unreachable - need to frag (mtu 1006) (DF) (ttl 255, id 19425)
15:05:46.509834 198.41.0.6.43 > 204.92.254.2.4721: . 4:1464(1460) ack 6 win 8760 (DF) (ttl 245, id 43047)
15:05:46.510716 204.29.161.41 > 198.41.0.6: icmp: 204.92.254.2 unreachable - need to frag (mtu 1006) (DF) (ttl 255, id 19433)
15:06:29.615496 198.41.0.6.43 > 204.92.254.2.4721: R 4:4(0) ack 6 win 0 (ttl 55, id 23874)

Note that ICMP packets get through correctly, seemingly because NetBSD
fragments them on the way through (I suppose a "needs frag" packet could
be sent in this case too, but that does seem a little too intertwined to
be reliable).

Here's the trace of two big (-s 1400) packet pings from the router's POV:

15:17:21.092624 204.92.254.2 > 198.41.0.6: icmp: echo request (ttl 253, id 41845)
15:17:21.366494 198.41.0.6 > 204.92.254.2: icmp: echo reply (ttl 246, id 25176)
15:17:21.978679 204.92.254.2 > 198.41.0.6: icmp: echo request (ttl 253, id 41855)
15:17:22.227824 198.41.0.6 > 204.92.254.2: icmp: echo reply (ttl 246, id 25351)

And here's what I see on my end of the link corresponding to the above:

15:17:17.466591 204.92.254.2 > 198.41.0.6: icmp: echo request (ttl 255, id 41845)
15:17:18.467986 204.92.254.2 > 198.41.0.6: icmp: echo request (ttl 255, id 41855)
15:17:18.487006 198.41.0.6 > 204.92.254.2: icmp: echo reply (frag 25176:984@0+) (ttl 244)
15:17:18.489940 198.41.0.6 > 204.92.254.2: (frag 25176:24@984) (ttl 244)
15:17:19.251136 198.41.0.6 > 204.92.254.2: icmp: echo reply (frag 25351:984@0+) (ttl 244)
15:17:19.263880 198.41.0.6 > 204.92.254.2: (frag 25351:24@984) (ttl 244)

Perhaps routers (i.e. NetBSD when it's routing, in this case) should
also fragment TCP packets on the way through if there are "too many"
retransmissions of over-sized packets (one's too many for me, but I
guess on high-latency links there might be two or three in the pipe --
perhaps a timer would help adjust when to do local fragmenting).

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>      <robohack!woods>
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>