Subject: Re: kern/17616: accept on a tcp socket loses options
To: None <gnats-admin@netbsd.org>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-net
Date: 07/17/2002 13:11:03
I figured out my recent (May) weird tcp behavior.

On Thu, 2 May 2002, Bill Studenmund wrote:

> I've been playing with some iSCSI code, and noticed a very odd behavior
> with a userland test program. It performs a series of iSCSI pings (NOP-OUT
> with optional NOP-IN echo). They all fly along (less than like .2 seconds
> for 1000 itterations), until the one test case where the iSCSI target is
> echoing 4k of data back. That test takes 200 seconds.
>
> I looked into it, and the problem is that, for that one test case, the
> target is sending data back in two writes. The first is a 48-byte iSCSI
> PDU, the other is the 4k of test data. For some reason, the target waits
> for an ack from the initiator before sending the 4k response. That ack
> takes .199 seconds to arrive, thus adding the delay.
>
> What I really don't get is that both sides are doing the same write
> sequence (48 bytes, 4k), with the same tcp options (TCP_NODELAY), running
> on the same machine (using localhost), but only one side of it is having
> to delay.

Turns out that the problem is that our code doesn't copy tcp options from
the listening socket to the connected socket. As mentioned in the PR, this
seems like a bug to me.

I have a patch which corrects the behavior:
Index: tcp_input.c
===================================================================
RCS file: /cvsroot/syssrc/sys/netinet/tcp_input.c,v
retrieving revision 1.122.2.8
diff -u -r1.122.2.8 tcp_input.c
--- tcp_input.c	2002/06/20 03:48:54	1.122.2.8
+++ tcp_input.c	2002/07/17 20:11:55
@@ -3238,6 +3238,7 @@
 #endif
 	else
 		tp = NULL;
+	tp->t_flags = sototcpcb(oso)->t_flags & TF_NODELAY;
 	if (sc->sc_request_r_scale != 15) {
 		tp->requested_s_scale = sc->sc_requested_s_scale;
 		tp->request_r_scale = sc->sc_request_r_scale;

***

I only copy over TF_NODELAY as it's the only user-settable tcp flag.

Making this change would be a behavior change. Before, if you set
TCP_NODELAY on a listening socket, you got connected sockets without it.
Now you'll get connected sockets with it.

I don't think we need to worry about this (in this particular case) as I
can't think of a case where a program would sett TCP_NODELAY on a
listening socket and expect it not to be set on the connected ones; BSD
wisdom was that you set TCP_NODELAY after the accept, and Linux wisdom
(one of the things I think Linux actually did right) is that you set
TCP_NODELAY on the listening socket so that you get it set on all of the
connected ones.

I think the change should be documented, but I'm not sure where. It would
seem weird to discuss TCP_NODELAY on the accept(2) man page, but anywhere
else might be a bit burried.

Thoughts?

Take care,

Bill