tech-net archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Serious TCP performance regression in 4.0?


I have recently been made aware of what appears to be a serious
TCP bulk data transfer performance regression experienced by
NetBSD 4.0.

It appears that whenever a NetBSD 4.0 system is either a sender
or a receiver, the problem appears.

One way to reproduce the problem is to run "iperf -s" on the
receiver and and "iperf -c <receiver> -i 3 -t 30" on the sender.
I've used a variety of combination of systems, but it's also
reproducible by doing this test on two neighboring systems,
connected by either 10Mbit/s, 100Mbit/s or 1Gbit/s Ethernet, and
with the default 32KB of TCP window size.  What I am seeing is
that this results in only about 3Mbit/s of throughput,
irrespective of the speed of the underlying medium(!)

The good news is that if both ends of the connection are running
code from either NetBSD 3.1 or NetBSD 5.0_BETA, this problem does
not occur, and one can get somewhat more reasonable performance;
easily around 80Mbit/s on a shared 100Mbit/s network.

I have taken tcpdumps on the sender systems in several of the
tests I have done, and put it through the tcptrace analysis
program (pkgsrc/net/tcptrace) with"tcptrace -G -l <tcpdump-file>"
and looked at the resulting TCP time-segment plots with xplot
(pkgsrc/graphics/xplot) with "xplot a2b_tsg.xpl".  The following
pattern emerges:

1) With NetBSD 4.0 as the sender, after the startup phase, there
   are bursts of sending, filling the sender window.  This is
   acked quickly thereafter, opening up the sender window.
   However, the sender sits idle for around 100ms before a new
   burst is sent, and the pattern repeats.

2) With NetBSD 4.0 as the receiver, there are again bursts of
   sending, but the receiver does not send acks to open up the
   window, but instead sits there for around 100ms before the
   acks opening up the window are sent.  Observations of "netstat
   -f inet ... | grep <portno>" reveals that the "receive queue"
   coloumn has typical values close to 32KB, which might be the
   reason why the receiver isn't acking the data to open up the
   window.  I beleive the read/write operations done by iperf are
   8KB by default, so there should be plenty of data to satisfy
   the outstanding socket read request.

Typically you don't get any retransmissions in the TCP sessions.

A partial tcpdump of the situation with 2) follows, using the
"-ttt" option to display relative timestamps in microseconds:

   000666 IP B.5005 > A.61774: . ack 76649 win 30684
   000002 IP B.5005 > A.61774: . ack 79545 win 27788
   000001 IP B.5005 > A.61774: . ack 82441 win 24892
   000001 IP B.5005 > A.61774: . ack 85337 win 21996
   000001 IP B.5005 > A.61774: . ack 88233 win 19100
   000001 IP B.5005 > A.61774: . ack 91129 win 16204
   000001 IP B.5005 > A.61774: . ack 94025 win 13308
   000002 IP B.5005 > A.61774: . ack 96921 win 10412
   000001 IP B.5005 > A.61774: . ack 99817 win 7516
   000001 IP B.5005 > A.61774: . ack 102713 win 4620
   000001 IP B.5005 > A.61774: . ack 105609 win 1724
   000001 IP B.5005 > A.61774: . ack 105609 win 9916
   000039 IP A.61774 > B.5005: P 105609:106521(912) ack 1 win 4197
   000011 IP A.61774 > B.5005: . 106521:107969(1448) ack 1 win 4197
   000006 IP A.61774 > B.5005: . 107969:109417(1448) ack 1 win 4197
   000005 IP A.61774 > B.5005: . 109417:110865(1448) ack 1 win 4197
   000002 IP A.61774 > B.5005: . 110865:112313(1448) ack 1 win 4197
   000004 IP A.61774 > B.5005: . 112313:113761(1448) ack 1 win 4197
   000007 IP A.61774 > B.5005: . 113761:115209(1448) ack 1 win 4197
   000681 IP B.5005 > A.61774: . ack 107969 win 7556
   000001 IP B.5005 > A.61774: . ack 110865 win 4660
   000001 IP B.5005 > A.61774: . ack 113761 win 1764
   107508 IP B.5005 > A.61774: . ack 115209 win 8508
   000002 IP B.5005 > A.61774: . ack 115209 win 16700
   000002 IP B.5005 > A.61774: . ack 115209 win 24892
   000001 IP B.5005 > A.61774: . ack 115209 win 33084

(I've edited away " <nop,nop,timestamp 1 1>" from all of the
lines, for readability.)

As you'll see, the receiver at B waits about 107ms before opening
up the receive window, allowing the sender at A to send again.
What is it doing during those 107ms?  Waiting for some internal
timer in the kernel to fire?!?

I have tried to look at the CVS history for tcp_input.c, but I
can't offhand find anything which would explain this behaviour.

Am I the only one who observes this behaviour? (I suspect not.)

Any hints as to what the actual cause might be?  Perhaps the
source of the problem isn't in the TCP stack at all, but rather
in the socket handling code and under what conditions the
application program is woken up?

Best regards,

- Håvard

Home | Main Index | Thread Index | Old Index