Subject: Re: Networking question MTU on non-local nets
To: None <port-macppc@netbsd.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: port-macppc
Date: 06/14/2003 19:48:05
> The problem I have apparently requires four things to cause _badness_:

> 	4. One of the routers' (or endpoint's) fragmentation/re-assembly
> 	is busted.

Well, if one end doesn't conform to the spec (which calls for all
endpoints to be capable of reassembly), I'm surprised this is the only
form of brokenness you're seeing.  But....

> 22:50:05.864796 192.168.0.38.49815 > mercy.icompute.com.http: S 1026952240:1026952240(0) win 32768 <mss 1460,nop,wscale 0,nop,nop,timestamp 707979176 0> (DF)
> 22:50:06.123198 mercy.icompute.com.http > 192.168.0.38.49815: S 2847775046:2847775046(0) ack 1026952241 win 16384 <mss 1414,nop,wscale 0,nop,nop,timestamp 17981721 707979176>
> 22:50:06.123306 192.168.0.38.49815 > mercy.icompute.com.http: . ack 1 win 33648 <nop,nop,timestamp 707979176 17981721> (DF)
> 22:50:06.127597 192.168.0.38.49815 > mercy.icompute.com.http: P 1:225(224) ack 1 win 33648 <nop,nop,timestamp 707979176 17981721> (DF)
> 22:50:06.397694 mercy.icompute.com.http > 192.168.0.38.49815: . 1:993(992) ack 225 win 17520 <nop,nop,timestamp 17981722 707979176> (frag 8006:1024@0+)
> 22:50:06.397704 mercy.icompute.com > 192.168.0.38: (frag 8006:422@1024)
> 22:50:06.444745 192.168.0.38.49815 > mercy.icompute.com.http: . ack 1415 win 33648 <nop,nop,timestamp 707979177 17981722> (DF)
> 22:50:06.705762 mercy.icompute.com.http > 192.168.0.38.49815: . 1415:2407(992) ack 225 win 17520 <nop,nop,timestamp 17981722 707979176> (frag 8007:1024@0+)
> 22:50:06.718277 mercy.icompute.com.http > 192.168.0.38.49815: . 2829:3821(992) ack 225 win 17520 <nop,nop,timestamp 17981722 707979176> (frag 8008:1024@0+)
> 22:50:07.761468 mercy.icompute.com.http > 192.168.0.38.49815: . 1415:2407(992) ack 225 win 17520 <nop,nop,timestamp 17981724 707979176> (frag 8009:1024@0+)
> 22:50:10.761691 mercy.icompute.com.http > 192.168.0.38.49815: . 1415:2407(992) ack 225 win 17520 <nop,nop,timestamp 17981730 707979176> (frag 8010:1024@0+)

> To my ignorant eyes, it looked like some of the fragmented packets
> were not getting through.

Yes, it looks that way to me too.  But that doesn't mean reassembly is
broken; if the fragments don't arrive, they can't be reassembled no
matter what.

The first suspect that comes to mind is the NAT box.  (There must be
some kind of NAT box involved, as 192.168.0.38 is not a globally
routable address.)  A lot of such boxes are poorly implemented, and it
wouldn't surprise me at all if it didn't handle connection state
properly and thus blocked the second and later fragments because it
couldn't be sure they corresponded to an established connection.  (It's
obviously not quite that simple, though, as one of the second fragments
got through, at 06.397704.)

The second suspect that comes to mind is defective networking hardware.
Some hardware has trouble receiving back-to-back packets.  Since your
local network is (presumably) faster than your uplink, you never get
back-to-back packets normally.  But it seems that when fragmentation
occurs, you get them; check the timestamps for the packet that did get
both fragments through:

> 22:50:06.397694 [...] (frag 8006:1024@0+)
> 22:50:06.397704 [...] (frag 8006:422@1024)

...interesting.  What sort of medium are you using?  Reception times
are normally the time the end of the packet arrives, plus factors such
as interrupt latency.  The second packet is roughly 3500 bits long.  At
10Mbit, that's about 350µs; at 100Mbit, 35µs.  But those packets are
only 10µs apart; either you're using gig-e or some such, or they both
got handled during the same interrupt, and either way, they were
probably back-to-back - and, maybe, fragmented on the client's side of
the slow link, odd as that seems.

If you can, it might be worth trying to step down to a lower speed;
that may space things out more.

If there is speed-adapting going on (eg, a 10/100 hub with one device
at 10 and one at 100 - though as I remark, it appears that the network
is faster than 100), that can also cause trouble with back-to-back
packets, as they don't always have enough buffering.

> It's pertty clear that my webserver - up till today - was sending out
> packets in the 1400+ range to just about everyone, and I've had very
> few complaints, so obviously you can get away with it _most_ of the
> time.

Yes, for suitable values of "get away with".  Most people are not
capable of identifying the problem when faced with a black hole such as
I outlined, and based on my lack of success in getting things fixed,
even people that are competent to identify it would likely give up
after a little while on bothering to complain.

> Worst case, turning on PMTU-D should not *hurt* anything on my
> server, as it should still work fine for all those connections that
> can handle the larger packets.  The only place I get in trouble are
> those places that were broken before **and** have broken routers
> and/or packet filtering that causes PMTU-D to fail to function.

Yes, as long as _you_ don't have any filtering that defeats PMTU-D, it
should make the test above work.  You'll be swapping failure in the
face of one brokenness (dropped fragments) for failure in the face of a
different brokenness (dropping outbound ICMP); whether this is a win
depends on which you care about more.  My guess, for what it's worth,
is that it would be a win, that losing fragments is commoner than
filtering outbound need-to-frag ICMPs.

A more detailed description of the client-side network might help:
where is the low-MTU link, what hardware is on each end of it, where is
the NAT being done, what speeds are the various pieces running at, that
sort of thing.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B