Subject: Re: TCP_NODELAY and full links (was Re: sup problems?)
To: None <feico@pasta.cs.uit.no>
From: Sean Doran <smd@ebone.net>
List: current-users
Date: 09/29/1999 04:01:13
Feico Dillema writes:
| Am I right to conclude that it is safe or at least reasonable to have
| PMTU disc. switched on for a server not directly behind such a
| blackhole? How about clients behind such a blackhole? Will they be
| blocked?
It's reasonable, yes, but there are risks.
Below I use client and server although really it's TCP sender and
TCP receiver in each case; TCP does not care about clients and servers
per se.
Case #1: no filtering out of ICMP
-- server sends big packet with "Don't Fragment" (DF) option set
towards client
-- router somewhere tries to forward big packet onto small-packet-only
interface, and cannot
-- router sends "Needs Fragmentation" ICMP message back to client
hopefully with the MTU of the small-packet-only interface embedded
in it
-- server tries a smaller packet (small-packet-only interface
MTU sized if known from "NF" message content) and hopefully
that works (otherwise you see "Needs Fragmentation" again)
-- at intervals, probe the path MTU by repeating the four steps above
Result: we reasonably rapidly converge on the MTU of the path
between server and client, and for bulk transfers end up using
the biggest possible packets, thus improving network efficiency.
Case #2: ICMP Needs Fragmentation messages are filtered out in the path
between the router with the small-packet-only interface and the
server itself
-- big TCP segment is sent with "Don't Fragment" IP option set
-- no "Needs Fragmentation" ICMP message is seen, as it has
been eaten by a filter or otherwise lost in transit
-- TCP RTO (roundtrip time out) or 3x duplicate ACK occurs,
with the client ACKing the segment before the "big segment"
-- segment is resent as is, thereby triggering "NF" which again never
arrives. loop to previous step until conversation times out.
Result: conversation killed by ICMP black hole.
Case #3: as per case #2 but server can detect and avoid PMTU blackholes
-- TCP RTO (roundtrip time out) or 3DUPACK occurs, smaller
segment is resent
-- when ACK is seen, use the smaller segment size as basis for new
path MTU, otherwise loop to previous step
-- probe path MTU at intervals
Result: we stall each time we send a packet that is bigger
than the path MTU. We do converge on the maximum supported
packet size and thereby improve network efficiency, but
we send at a slower rate due to packet losses.
If we're lucky we see DUPACKs and do a fast-retransmit/fast-recovery,
and the stall is not so bad, and possibly even beneficial (e.g., in
Greg Wood's congested FIFO case, the halving of cwnd reduces
unnecessary buffer occupation).
If we are unlucky, we see RTOs, and have to slow-start.
(This drains Greg's queue to zero, which might be good for other
things trying to use his link, but not so good for the speed
at which the affected TCP bulk transfer finishes).
Case #4: subtle optimization of #3
-- TCP RTO (roundtrip time out) occurs, resend packet as-is but
without "Don't Fragment" option
-- if ACK is seen, then next RTT try a big packet with DF again,
and if we see an RTO, we have probably stumbled into an ICMP
black hole, so use the smaller MTU
Result: more clues about whether we are seeing packets
being lost because of path MTU blackhole instead of congestion.
We want to reprobe the path MTU because it can change over time,
and we would like to exploit any *increase* in path MTU, so
every so often we send a larger packet if we aren't already at the
MTU of the medium to which the server is directly connected.
Every so often can be as aggressive as once per RTT, and being
aggressive is great if we haven't seen a "Needs Fragmentation"
message or an RTO yet. Once we've seen an "NF" or we take a guess
that the RTOs are as a result of blackholing rather than congestion
(i.e., if we're losing the big packet and not the small ones,
we see an ICMP blackhole) being aggressive is more likely wasteful
than useful.
There are longwinded arguments here and there about how fast one
should react to decreases in path MTU. One school is very definitely
in favour of setting DF on *every* packet, so that any path MTU change
can be dealt with immediately. Another school favours probing only
once per RTT. It all depends on how deep your feelings are about the
risks of IP fragmentation. (Mine aren't very deep; I often turn
off path MTU discovery and send segments (without DF set!) that
are the minimum of the attached-network MTU and the receiver's
advertised MSS and if it's a busy box being talked to by alot of
random people, the standard ethernet MTU).
Sean.