Subject: incorrect need-frag ICMPs?
To: None <tech-net@netbsd.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: tech-net
Date: 06/28/2000 02:09:29
Under what circumstances can NetBSD generate need-frag ICMP
unreachables incorrectly (ie, when the packet does fit the interface)?

I appear to have found a way to cause this to happen, at least somewhat
repeatably.

Specifically, I have the following setup:

+---------+    +-------+    +------------+    +-------+
| sparkle |    | stone |    | troglodyte |    | omega |
+---------+    +-------+    +------------+    +-------+
     |          |     |           |               |
  ---+----------+--   |        ---+--------+------+---
                      |                    |
                      +-- (the internet) --+

The link between sparkle and stone is my home LAN, which is 10Mb - an
AUI multiport repeater and a 10baseT hub.  The link from stone to the
net is an ADSL line.  Troglodyte and omega are at one of my workplaces,
where they are linked by an Ethernet-layer switch to a routing module
which gateways packets to/from the global internet.

Stone and troglodyte run custom code to encapsulate and tunnel and
de-encapsulate packets (my own code because I needed some things none
of the existing encapsulations looked ready to do).  The pseudo
interface used for this on stone is configured with MTU 1300:

encap0: flags=11<UP,POINTOPOINT> mtu 1300
	inet 216.46.5.9 --> 10.0.0.1 netmask 0xffffffff

Everything normally works fine.  I have path MTU discovery turned on on
sparkle, and it quickly discovers that external hosts have a path MTU
of 1300 (or, occasionally, less).  And normally, a connection sending
data from sparkle to omega settles down to happily pumping out 1248
bytes of data per packet (1300, minus 52 header bytes).

But sometimes, under circumstances I don't quite understand, stone gets
into a state where sparkle sends out a perfectly normal-looking packet,
1300 bytes long (1248 bytes of TCP payload), and stone sends back a
need-frag unreachable giving MTU 1300.  Stone resends the same packet,
which falls into a black hole (omega never sees it).  Eventually
sparkle's retransmission timer goes off and the pattern repeats.

Sometimes the bursts are just as sketched above, three packets (data,
ICMP, data); sometimes they involve five packets (data, ICMP, data,
ICMP, data).  They always occur fairly fast with long periods of
inactivity (retransmission timeouts, presumably) between them.
Eventually sparkle gives up trying to get a response out of omega and
declares the connection broken.

I've sniffed at stone's interface, and the incoming packets are indeed
1300 bytes long; I cannot understand why they provoke need-frag ICMPs.
Both sparkle and stone are SPARCs running 1.4T.  I've pored over
ip_input.c (ip_forward in particular), trying to see what on earth
could cause this, to no avail.  It appears to have something to do with
the connection sitting idle; I've seen it happen only after
connectivity outages, or when there is a comparatively long time
between the three-way handshake and any data being sent.  I have a full
tcpdump of this latter case which I can send to anyone interested; I
have a tcpdump -w file I can send, and can of course run tcpdump -r on
it if you prefer.

I've checked the driver for the encap interface, and it can't ever
generate EMSGSIZE from its output routine - and packets with 1248-byte
payloads normally work fine anyway.

I did add a log() call to ip_output(), where EMSGSIZE is generated if
DF is set, and it seems that the mtu it's comparing the packet length
against is 1024, not 1300.  The only place I can see that this might
possibly have come from is encap1, which does have MTU 1024 - but
that is not configured up (and in fact has never been touched since
boot) and "netstat -rn" does not show any routes through it, so I can't
see how ip_output could have taken it into its pointy little brain to
go anywhere near it.  It's not simply the minimum across all
interfaces; there's also sl1, MTU 296, but it picks 1024 anyway.  It
also means that it's returning the wrong MTU in the need-frag message,
then, since it's comparing against 1024 but reporting 1300.

The source tree I'm working with has ip_output.c version 1.67; I have
no private patches to it (except for the log() call I mentioned above,
which is strictly temporary debugging).

Any clues?  Any further tests I can run that might help?

					der Mouse

			       mouse@rodents.montreal.qc.ca
		     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B