tech-net: Re: TCP hang on netbsd-2

Subject: Re: TCP hang on netbsd-2
To: Manuel Bouyer <bouyer@antioche.eu.org>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-net
Date: 06/01/2005 23:27:21
--aVD9QWMuhilNxW9f
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sun, May 29, 2005 at 07:55:24PM +0200, Manuel Bouyer wrote:
> Hi,
> I'm getting hung ssh connections from a 2.0_RC2 to a 2.0_STABLE box.
> This ssh has lots of traffic on stdout.

[snip]

> So it seems that data from client -> server is still working, but the send
> queue on the server nevers drain. Does someone remember seeing/fixing thi=
s ?
> I think I've also been hit by this between the same 2.0_STABLE server and=
 a
> 1.6.2 client, but I didn't collect infos at this time. I also think I'm
> seeing this only since I updated this server from 2.0_BETA to 2.0_STABLE:
> NetBSD pop.lip6.fr 2.0_STABLE NetBSD 2.0_STABLE (GENERIC.MP) #0: Sun Apr =
10 15:46:42 CEST 2005  root@pop.lip6.fr:/local/pop1/bouyer/tmp/i386/obj/loc=
al/pop1/bouyer/netbsd-2-0-clean/src/sys/arch/i386/compile/GENERIC.MP i386

I too am seeing what I think is a similar issue. I'm seeing it with a
3.99.1 kernel, but I've been seeing it for quite a while and it could have
come up in the time frame you mention. An app that sends a lot of data
will occasionally just stall.

My specific issue is that I have an iSCSI test tool that is talking to an
iSCSI target. One of its tests is to send a stream of NOP-OUT packets and
time how long it takes. An iSCSI NOP-OUT is just an in-band NOP packet
sent from the initiator to the target. A NOP-OUT can either just be sent
(no response required) or it can request the target reply with a NOP-IN
containing the data of the original NOP-OUT data payload. Both a NOP-OUT
and NOP-IN require 48 bytes of packet header.

The test sends 1000 NOP-OUTs with 0, 4, 16, 64, 256, 1024, and 4096 bytes=
=20
of payload. It does this first without and then with reply request. Thus=20
in the without case, there is a unidirectional data stream, and in the=20
"with" case there is a bi-directional stream.

Oh, before anyone asks, both sides set TCP_NODELAY to 1.

I run the tests over the loopback interface for simplistic testing. For
the first few tests (0, 4, 16, 64 byte payloads) without echo, everything
proceeds quickly. It takes on average 10 ns to send each NOP-OUT. Looking=
=20
at a tcp dump, the target side of the TCP connection sends an ack every=20
other data-in PDU, and the window stays wide open.

Oh, just to jump ahead a bit, all of the echoed tests, where the NOP-OUT=20
data are sent back, run fine. They go from 60 ns to 93 ns for 4k of data.=
=20
This is all on a slow test machine.

The problem comes when we get to the 256, 1024, or 4096 byte payloads
without echo. The writing procedure will just stop, for seconds at a time.
As noted above, both processes are running on the same box, and it's idle
at this time. As best I can tell, there is no real reason that the data
can't flow, other than that the receiver hasn't acknowledged received and
processed data & asked for more. So we end up with writes averaging on the
order of 5 ms or 18 ms! If it takes less than 100 ns to send data
round-trip, something's wrong for one-way to take 18,000 ns...

=46rom looking at a tcp dump, things are going fine, and then the receiving
side starts closing the window. It still acks every other packet, but the
window goes down. Finally it gets to less than a protocol write (256 +
48), and the sender stops sending. Nothing happens for five seconds, then
the sender tentatively sends the open-window's worth of the packet. The
receiver then acks it and opens the window up 100% and a flury of data
comes across (it's localhost, so it's a pair of 33220 byte packets), gets
acked, and all proceeds.

I've looked at what's happening with the programs by simultaneously=20
ktracing both the target and the tester and also tcpdumping. While that=20
tcpdump dropped a few packets, I was able to see that when the test tool=20
stopped sending, the target proceeded to read all of the awaiting data.=20
Then it entered its idle loop processing in this many-second window of=20
delay.

While I can see we don't necessarily want to send a window update for=20
every write, shouldn't we notice that the last window we advertized was=20
only 1% open and we're now 100% open??

Thoughts? I can probably share the relevant tcp dump if that'd help.

Take care,

Bill

--aVD9QWMuhilNxW9f
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)

iD8DBQFCnqbJWz+3JHUci9cRAgfoAJ4hyPfb7ZnFkj6COObnOYnUVfh6FQCcC6hl
z9IK3Qj0PA1mNyfz0Gp3jeo=
=Kt2r
-----END PGP SIGNATURE-----

--aVD9QWMuhilNxW9f--