Subject: Re: Networking question MTU on non-local nets
To: None <port-macppc@netbsd.org>
From: Donald Lee <MacPPC@caution.icompute.com>
List: port-macppc
Date: 06/15/2003 09:33:33
Recently, der mouse cogently squeaked:
>[Sparkle] 864> traceroute -P pddf654.tkyoac00.ap.so-net.ne.jp
>traceroute to pddf654.tkyoac00.ap.so-net.ne.jp (218.221.246.84), 30 hops max, 17914 byte packets
>message too big, trying new MTU = 1500
> 1  Stone (216.46.5.9)  7.167 ms  5.946 ms *
> 2  core-04.openface.ca (216.46.14.121)  54.841 ms  52.423 ms  52.056 ms
> 3  bob.openface.ca (216.46.1.1)  51.988 ms  51.767 ms  51.595 ms
> 4  doug.openface.ca (216.46.1.16)  52.112 ms  52.349 ms  57.553 ms
> 5  border-peer1.openface.ca (216.46.0.245)  154.647 ms  54.558 ms  57.028 ms
> 6  openface-gw.peer1.net (65.39.144.129)  53.590 ms  54.071 ms  53.243 ms
> 7  Gig4-0.mtl-gsr-a.peer1.net (216.187.90.229)  54.545 ms  68.725 ms  54.515 ms
> 8  OC48POS0-0.nyc-gsr-b.peer1.net (216.187.123.234)  62.799 ms  77.577 ms  63.227 ms
> 9  GIG1-0.wdc-gsr-a.peer1.net (216.187.123.226)  67.580 ms  68.829 ms  68.616 ms
>10  ge-2-3-0.r02.asbnva01.us.bb.verio.net (206.223.115.112)  68.804 ms  67.953 ms  70.125 ms
>11  p16-0-1-2.r21.asbnva01.us.bb.verio.net (129.250.2.62)  74.460 ms  69.896 ms  69.524 ms
>12  p16-5-0-0.r01.mclnva02.us.bb.verio.net (129.250.2.180)  69.997 ms  70.349 ms  72.354 ms
>13  p16-7-0-0.r02.mclnva02.us.bb.verio.net (129.250.5.10)  71.260 ms  71.407 ms  70.120 ms
>14  p16-0-1-2.r20.plalca01.us.bb.verio.net (129.250.2.192)  127.981 ms  128.502 ms  129.232 ms
>15  xe-0-2-0.r21.plalca01.us.bb.verio.net (129.250.4.231)  128.421 ms  127.746 ms  144.905 ms
>16  p64-0-0-0.r21.snjsca01.us.bb.verio.net (129.250.5.49)  128.569 ms  137.866 ms  128.656 ms
>17  p16-1-1-0.r82.mlpsca01.us.bb.verio.net (129.250.3.195)  128.506 ms  129.103 ms  128.873 ms
>18  p16-0-2-0.r21.tokyjp01.jp.bb.verio.net (129.250.4.158)  243.949 ms  245.087 ms  244.665 ms
>19  xe-1-1-0.r20.tokyjp01.jp.bb.verio.net (129.250.3.233)  243.093 ms  242.310 ms  242.633 ms
>20  ge-3-0-0.a10.tokyjp01.jp.ra.verio.net (61.213.162.76)  230.151 ms  229.910 ms  230.652 ms
>21  61.120.146.230 (61.120.146.230)  230.672 ms ge-3-0-0.a10.tokyjp01.jp.ra.verio.net (61.213.162.76)  243.620 ms  241.961 ms
>22  61.120.146.230 (61.120.146.230)  242.440 ms  242.022 ms note-13Gi0-0-0.net.so-net.ne.jp (61.211.63.133)  230.177 ms
>23  61.211.63.247 (61.211.63.247)  232.482 ms  232.314 ms  234.189 ms
>24  61.211.63.247 (61.211.63.247)  232.197 ms  231.845 ms  233.382 ms
>25  61.211.63.247 (61.211.63.247)  234.610 ms
>fragmentation required and DF set, next hop MTU = 1454
>25  pddf654.tkyoac00.ap.so-net.ne.jp (218.221.246.84)  269.161 ms  269.644 ms  269.784 ms
>[Sparkle] 865> 
>
>If this is to be believed, the low-MTU link is the very last hop.  I
>really wonder what's with hops 21/22 and 23/24; the way different
>gateways respond on lines 21 and 22, it appears there is some kind of
>variant routing going on - loadsharing, maybe.
>
>>> A more detailed description of the client-side network might help:
>>> where is the low-MTU link, what hardware is on each end of it, where
>>> is the NAT being done, what speeds are the various pieces running
>>> at, that sort of thing.
>> As you can see above, getting that sort of data would be non-trivial.
>> It's tough enough getting the gentleman in Japan to send us the bits
>> of data he has.
>
>:-(  I misunderstood; I thought you actually had control over both ends
>of the test connection.
>
>Because my traceroute -P worked, I feel confident that the ICMP
>unreachables necessary to drive PMTU-D are making it out from the
>Japanese end of things.  But it does look to me as though something is
>broken on the client side; some but not all of the second frags making
>it through - but all the first frags working - practically guarantees
>that there is something wrong between the fragmentation point and the
>endpoint; since the fragmentation point is right next to the endpoint
>per my traceroute above, this means it's on that end.

I have a feeling that the endpoint is some sort of dynamic IP, too.

When I did some ping tests to this endpoint (ping -s xxx japan-guy), I
noticed that packets over 1400 bytes were getting through.  This
suggests that the fragmentation at ~1K was/is "transient".

>[snip]
>If it would help you, I can set up a machine deliberately behind a
>low-MTU link we can run experiments with.  (If you want to take me up
>on that, off-list is probably best.)

Thank you.  That's very generous.

I've been watching my server, now that
I've enabled PMTU-D, and I've noticed two things.  One, the machine is
still up and performing reasonably. (good) Two, when I look at the traffic,
and watch specifically for icmp "must frag" messages, they happen, but
are pretty rare.  I see a small number of them per hour on this server.
(The server gets roughly 1.5 million http requests per month)

I am satisfied at this point that
PMTU-D is safe and effective, so unless I encounter some major
badness, I plan to leave things as they are.

Many thanks to you and Manuel for your insights.

-dgl-