Hi, How the three are linked is an interesting story... I'll make it short. There are several bugs involved. If you wish to reproduce the issue, here's how I did it. The bug annoys me at work, so I built a sketch of that network using Xen DomUs. Here's the setup: +---+ Eth +---+ Eth +----------+ Eth +---+ Eth +---+ | C +=====+ A +=====+ Internet +=====+ B +=====+ D + +---+ +-+-+ +----------+ +-+-+ +---+ | gif tunnel with | +--------------------------+ IPsec transport The so-called "Internet" is actually the Dom0. The four Ethernet segments are done with bridges on the Dom0. The A-Dom0 network is 10.0.0/24, the B-Dom0 network is 10.0.1/24, the A-C network is 10.0.2/24 and the B-D network is 10.0.3/24. The gif tunnel connects addresses starting with 10.1.0. On each of its interfaces, the last component of the IP address is the hex number of the host. How IPsec is configured is completely irrelevant at this point: as a matter of fact you don't have to configure any policy, merely compiling a kernel with support for IPsec is enough. I expect the "Bad packet length" issue to be common for people using a similar setup, or an environment prone to MTU sensitiveness. You experiment it seeing scp, or even sometimes ssh, aborting with a message about a SSH packet length that indeed seems completely bogus. I've only been able to pinpoint the issue with ssh/scp. Any other application I could try (be it ftp, or a simple test I coded) are simply faced with the issue Edgar Fuß describes in a thread these days: the connection stalls. Note that it can happen with ssh/scp too, it's just that sometimes ssh/scp will go further for some timing reason. So many bugs to pick from... I have a trace of a failing scp session at the following address: http://taryn.cubidou.net/~cube/onA As its name suggests, it's a capture done on machine A. The scp was trying to send a file from C to D. The IPsec bug ============= First thing's first. We'll see how fixing that bug hides the SACK one later. Look at the "Need Fragment" ICMP messages in the trace. Can you see what is wrong? Here I thank Michael Van Elst for analysing this bug. In ip_output.c:ip_forward(), we get EMSGSIZE returned because of the MTU of the gif interface (1280 in a default setup). There is a mention here, "XXX quickhack!!!" for good reason: the code doesn't make sure the packet was actually going to be handled by an IPsec policy. In this case, it's not (at this point it hasn't been forwarded to the gif interface). The problem is ipsec4_getpolicybyaddr() will always return something, the default policy if no existing policy matched. The "req" field of that policy is NULL, hence "destmtu" doesn't get set and stays at 0, which we send back in the ICMP message. This is a bug that needs to be fixed. Michael suggested the following workaround for now: --- ip_input.c 16 Sep 2007 15:34:59 -0000 1.236.2.1 +++ ip_input.c 20 Feb 2008 15:11:20 -0000 @@ -1996,7 +1996,7 @@ IPSEC_DIR_OUTBOUND, IP_FORWARDING, &ipsecerror); - if (sp == NULL) + if (sp == NULL || sp->req == NULL) destmtu = ipforward_rt.ro_rt->rt_ifp->if_mtu; else { /* count IPsec header size */ The SACK bug (and other TCP issues) =================================== That one took a lot longer to track down, because I had to learn how TCP works in NetBSD, which, well, took time. I find it interesting that TCP doesn't act immediately upon receiving an ICMP "Need Fragment" message. What the code does is record it and wait to use it in the handler of the retransmission timeout handler. My guess here is that not all packets that have been sent might get blocked, and if any of them went through, the peer will send a duplicate ACK, possibly with SACK information and fast retransmit will roll forward. As you can see in the trace though, it's not the case here. There was enough data in the socket buffer to send 4 segments in a burst (the maximum allowed), and all of them got an ICMP reply. Therefore we wait for a whole 1.5 second for the timeout to happen. I really think we should get the route created and adjust the segment size immediately. Maybe there are other reasons for not doing that. Anyway, what does the code do with that "MTU 0" thing? It goes all the way down to ip_icmp.c:icp_mtudisc(), which then uses a hard coded table of values to try (which I find rather smart). The first possible one is 1492, so that's what the TCP code will use. At this point SACK is not yet involved of course, having received no ACK. Plain retransmit is the way to go, using 1440 as the segment size instead of 1448 before. I'm not entirely sure why there are two retransmission here. But the second one does send 4 packets, and the congestion window has shrunk enough so that the last packet is small enough to go through the tunnel. We get a duplicate ACK in return with SACK information, so we initiate SACK recovery. While ICMP Need Fragment messages have been received, they have not been acted upon, so TCP still thinks it's ok to send 1440 bytes of payload. Tough luck. This is where it gets specific to ssh/scp that doesn't always run into the issue my test app and Edgar Fuß (I believe) are facing: the rexmt timer was armed a second time, so we get to do the Path MTU Discovery dance a second time. But SACK has recorded in its scoreboard that it has retransmitted some data. So tcp_output() will start where the previous packet left off. The congestion window make it so it's only possible to send a 468-byte long packet, so we do that. Now look at the next packet, frame #100 on the wire. Compare the data with frame #74. Can you see the bug? Now compare the date in #100 and in #102. What I know is that what's in #102 is correct, so NetBSD sent data from the buffer with the wrong sequence number. Looking back, I should have asked the following question at this point: why it is even sending this packet (and the subsequent ones)? We've filled the congestion window already, since the previous packet was not a full segment. A lot of debug allowed my to understand the issue: tcp_mtudisc invalidates all previous send and sets snd_nxt to the value of snd_una so that tcp_output will start from the beginning. Except SACK entered the picture, and it will try to send what's after what it already sent. After sending #99, as sendalot is 1, we goto again. "off" gets to point after the data sent in #99, but cwin is 0 because the congestion window is full. So sack_rxmit stays at 0. Things should stop there, except they don't, because when tcp_mtudisc reset snd_nxt, it didn't change sack_newdata, which means now snd_nxt is smaller than sack_newdata, so the second "cwin" gets set to a very large value, allowing the sending of full segemnts. But at that point, "off" is set in stone and doesn't agree with snd_nxt. Oops. My take at a patch: Index: tcp_subr.c =================================================================== RCS file: /cvsroot/src/sys/netinet/tcp_subr.c,v retrieving revision 1.208.2.1 diff -u -p -r1.208.2.1 tcp_subr.c --- tcp_subr.c 24 May 2007 19:13:13 -0000 1.208.2.1 +++ tcp_subr.c 20 Feb 2008 15:47:55 -0000 @@ -1729,7 +1729,7 @@ tcp_mtudisc(struct inpcb *inp, int errno /* * Resend unacknowledged packets. */ - tp->snd_nxt = tp->snd_una; + tp->snd_nxt = tp->sack_newdata = tp->snd_una; tcp_output(tp); } } tcp6_mtudisc() would need the same treatment. I looked at the other places where snd_nxt is reset in a similar way, but they seem to be safe: SACK either hasn't happened yet, or is completely out of the picture. A second look wouldn't be too much though. While those two patches makes the whole thing happy, I think we should re-visit the path MTU discovery code to be more efficient. In the case of blackholes for instance, we should make use of icmp_mtudisc's clever table. And I still think we should act on ICMP Need Fragment messages immediately. I'd appreciate if someone with TCP and possibly SACK knowledge would confirm my analysis and the correctness of the patch I suggest. Otherwise I'll commit sometime later... The reason I ask is that when we change the MTU, we should probably invalidate all SACK retransmits, like we do for non-SACK retransmits. The patch I suggest doesn't do that, it only give a chance to the code not to send bogus data. -- Quentin Garnier - cube%cubidou.net@localhost - cube%NetBSD.org@localhost "See the look on my face from staying too long in one place [...] every time the morning breaks I know I'm closer to falling" KT Tunstall, Saving My Face, Drastic Fantastic, 2007.
Attachment:
pgpXmG92z7EyV.pgp
Description: PGP signature