Subject: Re: Networking question MTU on non-local nets
To: der Mouse <mouse@Rodents.Montreal.QC.CA>
From: Donald Lee <MacPPC@caution.icompute.com>
List: port-macppc
Date: 06/14/2003 20:53:45
This is wandering seriously off-topic, but still very interesting.....

At 7:48 PM -0400 6/14/03, der Mouse wrote:
>[snip]
>> 22:50:05.864796 192.168.0.38.49815 > mercy.icompute.com.http: S 1026952240:1026952240(0) win 32768 <mss 1460,nop,wscale 0,nop,nop,timestamp 707979176 0> (DF)
>> 22:50:06.123198 mercy.icompute.com.http > 192.168.0.38.49815: S 2847775046:2847775046(0) ack 1026952241 win 16384 <mss 1414,nop,wscale 0,nop,nop,timestamp 17981721 707979176>
>> 22:50:06.123306 192.168.0.38.49815 > mercy.icompute.com.http: . ack 1 win 33648 <nop,nop,timestamp 707979176 17981721> (DF)
>> 22:50:06.127597 192.168.0.38.49815 > mercy.icompute.com.http: P 1:225(224) ack 1 win 33648 <nop,nop,timestamp 707979176 17981721> (DF)
>> 22:50:06.397694 mercy.icompute.com.http > 192.168.0.38.49815: . 1:993(992) ack 225 win 17520 <nop,nop,timestamp 17981722 707979176> (frag 8006:1024@0+)
>> 22:50:06.397704 mercy.icompute.com > 192.168.0.38: (frag 8006:422@1024)
>> 22:50:06.444745 192.168.0.38.49815 > mercy.icompute.com.http: . ack 1415 win 33648 <nop,nop,timestamp 707979177 17981722> (DF)
>> 22:50:06.705762 mercy.icompute.com.http > 192.168.0.38.49815: . 1415:2407(992) ack 225 win 17520 <nop,nop,timestamp 17981722 707979176> (frag 8007:1024@0+)
>> 22:50:06.718277 mercy.icompute.com.http > 192.168.0.38.49815: . 2829:3821(992) ack 225 win 17520 <nop,nop,timestamp 17981722 707979176> (frag 8008:1024@0+)
>> 22:50:07.761468 mercy.icompute.com.http > 192.168.0.38.49815: . 1415:2407(992) ack 225 win 17520 <nop,nop,timestamp 17981724 707979176> (frag 8009:1024@0+)
>> 22:50:10.761691 mercy.icompute.com.http > 192.168.0.38.49815: . 1415:2407(992) ack 225 win 17520 <nop,nop,timestamp 17981730 707979176> (frag 8010:1024@0+)
>
>> To my ignorant eyes, it looked like some of the fragmented packets
>> were not getting through.
>
>[...probably a NAT box....]

Yes.  In fact, we got a traceroute from this guy:

>>>>traceroute to www.qdea.com (209.46.8.67), 30 hops max, 40 byte packets
>>>>  1  192.168.0.1 (192.168.0.1)  0.795 ms  0.57 ms  0.527 ms
>>>>  2  * * *
>>>>  3  * * *
>>>>  4  * * *
>>>>  5  * * *
>>>>  6  * * *
>>>>  7  * * *
>>>>  8  * * *
>>>>  9  * * *
>>>>10  * * *
>>>>11  * * *
>>>>12  * * *
>>>>13  * * *
>>>>14  * * *
>>>>15  * * *
>>>>16  * * *
>>>>17  * * *
>>>>18  * * *
>>>>19  * * *
>>>>20  * * *
>>>>21  * * *
>>>>22  * * *
>>>>23  * * *
>>>>24  * * *
>>>>25  * * *
>>>>26  * * *
>>>>27  * * *
>>>>28  * * *
>>>>29  * * *
>>>>30  * * *

Useful, huh?

From my side:

[joy:~] donlee% traceroute pddf654.tkyoac00.ap.so-net.ne.jp
traceroute to pddf654.tkyoac00.ap.so-net.ne.jp (218.221.246.84), 30 hops max, 40 byte packets
 1  valor (209.46.8.65)  1.538 ms  1.175 ms  1.152 ms
 2  216.245.174.14 (216.245.174.14)  69.187 ms  32.323 ms  33.247 ms
 3  216.245.167.130 (216.245.167.130)  65.844 ms  33.076 ms  33.434 ms
 4  216.245.174.163 (216.245.174.163)  64.262 ms  33.551 ms  40.906 ms
 5  ag-core-engp1-gig4-0.agiliti.net (216.245.174.41)  57.162 ms  33.389 ms  33.424 ms
 6  apr1-serial4-1-0.minneapolis.cw.net (208.174.7.41)  34.047 ms  33.847 ms  34.225 ms
 7  acr1.minneapolis.cw.net (208.174.2.61)  34.831 ms  61.8 ms  34.439 ms
 8  agr3-loopback.chicago.cw.net (208.172.2.103)  42.359 ms  54.099 ms  43.143 ms
 9  acr1-loopback.chicago.cw.net (208.172.2.61)  43.121 ms  44.67 ms  42.988 ms
10  cable-and-wireless-peering.chicago.cw.net (208.172.3.234)  49.743 ms  53.608 ms  50.124 ms
11  pos6-0-2488m.cr1.chi1.gblx.net (208.49.59.205)  83.668 ms  49.378 ms  49.883 ms
12  so1-3-0-622m.cr2.nrt1.gblx.net (203.192.128.193)  195.154 ms  203.735 ms  195.364 ms
13  pos15-0-2488m.ar2.nrt1.gblx.net (203.192.128.134)  195.783 ms  195.814 ms  196.904 ms
14  203.192.131.234 (203.192.131.234)  195.316 ms  195.304 ms  195.406 ms
15  note-12gi0-0-0.net.so-net.ne.jp (211.10.62.78)  195.725 ms  196.325 ms  198.96 ms
16  61.211.63.247 (61.211.63.247)  197.955 ms  196.689 ms  196.707 ms
17  pddf654.tkyoac00.ap.so-net.ne.jp (218.221.246.84)  219.913 ms  220.926 ms  221.33 ms

I see just a couple of opporunities to mess this up. ;->

The fact that the traceroute works (fro my side) is a good sign if I
am hoping to get PMTU-D working.

>[...back-to-back....]
>
>> 22:50:06.397694 [...] (frag 8006:1024@0+)
>> 22:50:06.397704 [...] (frag 8006:422@1024)
>
>...interesting.  What sort of medium are you using?  Reception times
>are normally the time the end of the packet arrives, plus factors such
>as interrupt latency.  The second packet is roughly 3500 bits long.  At
>10Mbit, that's about 350µs; at 100Mbit, 35µs.  But those packets are
>only 10µs apart; either you're using gig-e or some such, or they both
>got handled during the same interrupt, and either way, they were
>probably back-to-back - and, maybe, fragmented on the client's side of
>the slow link, odd as that seems.

Oooo.... Good catch.
This is very interesting.  I had not noticed the timings, and
10us is very impressive.

I seriously doubt that the fragmentation is happening on this
continent.  I'm betting that it's happening somewhere
in Japan.

With all the routers between here and there, no 10 us delay could survive
all those hops.

>[snip]
>> Worst case, turning on PMTU-D should not *hurt* anything on my
>> server, as it should still work fine for all those connections that
>> can handle the larger packets.  The only place I get in trouble are
>> those places that were broken before **and** have broken routers
>> and/or packet filtering that causes PMTU-D to fail to function.
>
>Yes, as long as _you_ don't have any filtering that defeats PMTU-D, it
>should make the test above work.  You'll be swapping failure in the
>face of one brokenness (dropped fragments) for failure in the face of a
>different brokenness (dropping outbound ICMP); whether this is a win
>depends on which you care about more.  My guess, for what it's worth,
>is that it would be a win, that losing fragments is commoner than
>filtering outbound need-to-frag ICMPs.

I sure hope not - but it's pretty hard to tell, once you get a couple
two three hops upstream - all bets are off.

>A more detailed description of the client-side network might help:
>where is the low-MTU link, what hardware is on each end of it, where is
>the NAT being done, what speeds are the various pieces running at, that
>sort of thing.

As you can see above, getting that sort of data would be non-trivial.
It's tough enough getting the gentleman in Japan to send us the bits
of data he has.

This is very helpful.  Thank you.

-dgl-