Subject: Re: Networking question MTU on non-local nets
To: None <port-macppc@netbsd.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: port-macppc
Date: 06/15/2003 01:07:21
> This is wandering seriously off-topic, but still very
> interesting.....

Well, a off-topic for port-macppc, maybe.  Maybe we should mvoe it to
tech-net?  (I'm already there, so feel free to just change the
to-address in your reply if you agree.)

>>> 22:50:05.864796 192.168.0.38.49815 > mercy.icompute.com.http: S 1026952240:1026952240(0) win 32768 <mss 1460,nop,wscale 0,nop,nop,timestamp 707979176 0> (DF)
>>> 22:50:06.123198 mercy.icompute.com.http > 192.168.0.38.49815: S 2847775046:2847775046(0) ack 1026952241 win 16384 <mss 1414,nop,wscale 0,nop,nop,timestamp 17981721 707979176>
>>> 22:50:06.123306 192.168.0.38.49815 > mercy.icompute.com.http: . ack 1 win 33648 <nop,nop,timestamp 707979176 17981721> (DF)
>>> 22:50:06.127597 192.168.0.38.49815 > mercy.icompute.com.http: P 1:225(224) ack 1 win 33648 <nop,nop,timestamp 707979176 17981721> (DF)
>>> 22:50:06.397694 mercy.icompute.com.http > 192.168.0.38.49815: . 1:993(992) ack 225 win 17520 <nop,nop,timestamp 17981722 707979176> (frag 8006:1024@0+)
>>> 22:50:06.397704 mercy.icompute.com > 192.168.0.38: (frag 8006:422@1024)
>>> 22:50:06.444745 192.168.0.38.49815 > mercy.icompute.com.http: . ack 1415 win 33648 <nop,nop,timestamp 707979177 17981722> (DF)
>>> 22:50:06.705762 mercy.icompute.com.http > 192.168.0.38.49815: . 1415:2407(992) ack 225 win 17520 <nop,nop,timestamp 17981722 707979176> (frag 8007:1024@0+)
>>> 22:50:06.718277 mercy.icompute.com.http > 192.168.0.38.49815: . 2829:3821(992) ack 225 win 17520 <nop,nop,timestamp 17981722 707979176> (frag 8008:1024@0+)
>>> 22:50:07.761468 mercy.icompute.com.http > 192.168.0.38.49815: . 1415:2407(992) ack 225 win 17520 <nop,nop,timestamp 17981724 707979176> (frag 8009:1024@0+)
>>> 22:50:10.761691 mercy.icompute.com.http > 192.168.0.38.49815: . 1415:2407(992) ack 225 win 17520 <nop,nop,timestamp 17981730 707979176> (frag 8010:1024@0+)

[quoted in full to have the information at hand in the future]

> [traceroute from client side]
>>>>>traceroute to www.qdea.com (209.46.8.67), 30 hops max, 40 byte packets
>>>>>  1  192.168.0.1 (192.168.0.1)  0.795 ms  0.57 ms  0.527 ms
>>>>>  2  * * *
[27 more lines of "* * *" snipped]
>>>>> 30  * * *
> Useful, huh?

Immensely. :-þ

> The fact that the traceroute works (fro my side) is a good sign if I
> am hoping to get PMTU-D working.

Indeed it is.  Below, I present further evidence that if it doesn't
work, the problem is on your side, where you can in principle fix it.

>> [...back-to-back....]
>>> 22:50:06.397694 [...] (frag 8006:1024@0+)
>>> 22:50:06.397704 [...] (frag 8006:422@1024)
>> ...interesting.  [...350µs at 10MBit...35µs at 100...only 10µs
>> apart...]
> I seriously doubt that the fragmentation is happening on this
> continent.  I'm betting that it's happening somewhere in Japan.

I'd guess that even without seeing the timings; whenever I've seen such
problems, the low-MTU link has been close to the client.
(Comparatively more clients than servers are behind PPPoE, VPNs, and
suchlike MTU-lowering things.)

But I agree with you; the fragmentation point is almost certainly in
Japan in this case.

If your traceroute supports -P (basically, this makes traceroute do its
own PMTU-D), you might try that.  On the theory that the low-MTU link
probably is close to the client, I did a traceroute -P from my own
machine to the name you quote yourself as tracerouting to, and...

[Sparkle] 864> traceroute -P pddf654.tkyoac00.ap.so-net.ne.jp
traceroute to pddf654.tkyoac00.ap.so-net.ne.jp (218.221.246.84), 30 hops max, 17914 byte packets
message too big, trying new MTU = 1500
 1  Stone (216.46.5.9)  7.167 ms  5.946 ms *
 2  core-04.openface.ca (216.46.14.121)  54.841 ms  52.423 ms  52.056 ms
 3  bob.openface.ca (216.46.1.1)  51.988 ms  51.767 ms  51.595 ms
 4  doug.openface.ca (216.46.1.16)  52.112 ms  52.349 ms  57.553 ms
 5  border-peer1.openface.ca (216.46.0.245)  154.647 ms  54.558 ms  57.028 ms
 6  openface-gw.peer1.net (65.39.144.129)  53.590 ms  54.071 ms  53.243 ms
 7  Gig4-0.mtl-gsr-a.peer1.net (216.187.90.229)  54.545 ms  68.725 ms  54.515 ms
 8  OC48POS0-0.nyc-gsr-b.peer1.net (216.187.123.234)  62.799 ms  77.577 ms  63.227 ms
 9  GIG1-0.wdc-gsr-a.peer1.net (216.187.123.226)  67.580 ms  68.829 ms  68.616 ms
10  ge-2-3-0.r02.asbnva01.us.bb.verio.net (206.223.115.112)  68.804 ms  67.953 ms  70.125 ms
11  p16-0-1-2.r21.asbnva01.us.bb.verio.net (129.250.2.62)  74.460 ms  69.896 ms  69.524 ms
12  p16-5-0-0.r01.mclnva02.us.bb.verio.net (129.250.2.180)  69.997 ms  70.349 ms  72.354 ms
13  p16-7-0-0.r02.mclnva02.us.bb.verio.net (129.250.5.10)  71.260 ms  71.407 ms  70.120 ms
14  p16-0-1-2.r20.plalca01.us.bb.verio.net (129.250.2.192)  127.981 ms  128.502 ms  129.232 ms
15  xe-0-2-0.r21.plalca01.us.bb.verio.net (129.250.4.231)  128.421 ms  127.746 ms  144.905 ms
16  p64-0-0-0.r21.snjsca01.us.bb.verio.net (129.250.5.49)  128.569 ms  137.866 ms  128.656 ms
17  p16-1-1-0.r82.mlpsca01.us.bb.verio.net (129.250.3.195)  128.506 ms  129.103 ms  128.873 ms
18  p16-0-2-0.r21.tokyjp01.jp.bb.verio.net (129.250.4.158)  243.949 ms  245.087 ms  244.665 ms
19  xe-1-1-0.r20.tokyjp01.jp.bb.verio.net (129.250.3.233)  243.093 ms  242.310 ms  242.633 ms
20  ge-3-0-0.a10.tokyjp01.jp.ra.verio.net (61.213.162.76)  230.151 ms  229.910 ms  230.652 ms
21  61.120.146.230 (61.120.146.230)  230.672 ms ge-3-0-0.a10.tokyjp01.jp.ra.verio.net (61.213.162.76)  243.620 ms  241.961 ms
22  61.120.146.230 (61.120.146.230)  242.440 ms  242.022 ms note-13Gi0-0-0.net.so-net.ne.jp (61.211.63.133)  230.177 ms
23  61.211.63.247 (61.211.63.247)  232.482 ms  232.314 ms  234.189 ms
24  61.211.63.247 (61.211.63.247)  232.197 ms  231.845 ms  233.382 ms
25  61.211.63.247 (61.211.63.247)  234.610 ms
fragmentation required and DF set, next hop MTU = 1454
25  pddf654.tkyoac00.ap.so-net.ne.jp (218.221.246.84)  269.161 ms  269.644 ms  269.784 ms
[Sparkle] 865> 

If this is to be believed, the low-MTU link is the very last hop.  I
really wonder what's with hops 21/22 and 23/24; the way different
gateways respond on lines 21 and 22, it appears there is some kind of
variant routing going on - loadsharing, maybe.

>> A more detailed description of the client-side network might help:
>> where is the low-MTU link, what hardware is on each end of it, where
>> is the NAT being done, what speeds are the various pieces running
>> at, that sort of thing.
> As you can see above, getting that sort of data would be non-trivial.
> It's tough enough getting the gentleman in Japan to send us the bits
> of data he has.

:-(  I misunderstood; I thought you actually had control over both ends
of the test connection.

Because my traceroute -P worked, I feel confident that the ICMP
unreachables necessary to drive PMTU-D are making it out from the
Japanese end of things.  But it does look to me as though something is
broken on the client side; some but not all of the second frags making
it through - but all the first frags working - practically guarantees
that there is something wrong between the fragmentation point and the
endpoint; since the fragmentation point is right next to the endpoint
per my traceroute above, this means it's on that end.

I notice something else weird:

22:50:06.397694 mercy.icompute.com.http > 192.168.0.38.49815: . 1:993(992) ack 225 win 17520 <nop,nop,timestamp 17981722 707979176> (frag 8006:1024@0+)
22:50:06.397704 mercy.icompute.com > 192.168.0.38: (frag 8006:422@1024)

The packet was fragmented into 1024-octet and 422-octet pieces;
however, according to traceroute -P, the MTU is 1454.  Normally,
fragmentation puts as much as possible in the first frag (which in this
case would lead to a second fragment with approximately 50 octets of
data).  I could also see a fragmentation module that tried to produce
roughly equal fragment sizes, but that wasn't done either.  I don't
know what the provenance is of the stack did the fragmentation, but
it's behaving rather unusually.

If it would help you, I can set up a machine deliberately behind a
low-MTU link we can run experiments with.  (If you want to take me up
on that, off-list is probably best.)

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B