[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Serious TCP performance regression in 4.0?
I have recently been made aware of what appears to be a serious
TCP bulk data transfer performance regression experienced by
It appears that whenever a NetBSD 4.0 system is either a sender
or a receiver, the problem appears.
One way to reproduce the problem is to run "iperf -s" on the
receiver and and "iperf -c <receiver> -i 3 -t 30" on the sender.
I've used a variety of combination of systems, but it's also
reproducible by doing this test on two neighboring systems,
connected by either 10Mbit/s, 100Mbit/s or 1Gbit/s Ethernet, and
with the default 32KB of TCP window size. What I am seeing is
that this results in only about 3Mbit/s of throughput,
irrespective of the speed of the underlying medium(!)
The good news is that if both ends of the connection are running
code from either NetBSD 3.1 or NetBSD 5.0_BETA, this problem does
not occur, and one can get somewhat more reasonable performance;
easily around 80Mbit/s on a shared 100Mbit/s network.
I have taken tcpdumps on the sender systems in several of the
tests I have done, and put it through the tcptrace analysis
program (pkgsrc/net/tcptrace) with"tcptrace -G -l <tcpdump-file>"
and looked at the resulting TCP time-segment plots with xplot
(pkgsrc/graphics/xplot) with "xplot a2b_tsg.xpl". The following
1) With NetBSD 4.0 as the sender, after the startup phase, there
are bursts of sending, filling the sender window. This is
acked quickly thereafter, opening up the sender window.
However, the sender sits idle for around 100ms before a new
burst is sent, and the pattern repeats.
2) With NetBSD 4.0 as the receiver, there are again bursts of
sending, but the receiver does not send acks to open up the
window, but instead sits there for around 100ms before the
acks opening up the window are sent. Observations of "netstat
-f inet ... | grep <portno>" reveals that the "receive queue"
coloumn has typical values close to 32KB, which might be the
reason why the receiver isn't acking the data to open up the
window. I beleive the read/write operations done by iperf are
8KB by default, so there should be plenty of data to satisfy
the outstanding socket read request.
Typically you don't get any retransmissions in the TCP sessions.
A partial tcpdump of the situation with 2) follows, using the
"-ttt" option to display relative timestamps in microseconds:
000666 IP B.5005 > A.61774: . ack 76649 win 30684
000002 IP B.5005 > A.61774: . ack 79545 win 27788
000001 IP B.5005 > A.61774: . ack 82441 win 24892
000001 IP B.5005 > A.61774: . ack 85337 win 21996
000001 IP B.5005 > A.61774: . ack 88233 win 19100
000001 IP B.5005 > A.61774: . ack 91129 win 16204
000001 IP B.5005 > A.61774: . ack 94025 win 13308
000002 IP B.5005 > A.61774: . ack 96921 win 10412
000001 IP B.5005 > A.61774: . ack 99817 win 7516
000001 IP B.5005 > A.61774: . ack 102713 win 4620
000001 IP B.5005 > A.61774: . ack 105609 win 1724
000001 IP B.5005 > A.61774: . ack 105609 win 9916
000039 IP A.61774 > B.5005: P 105609:106521(912) ack 1 win 4197
000011 IP A.61774 > B.5005: . 106521:107969(1448) ack 1 win 4197
000006 IP A.61774 > B.5005: . 107969:109417(1448) ack 1 win 4197
000005 IP A.61774 > B.5005: . 109417:110865(1448) ack 1 win 4197
000002 IP A.61774 > B.5005: . 110865:112313(1448) ack 1 win 4197
000004 IP A.61774 > B.5005: . 112313:113761(1448) ack 1 win 4197
000007 IP A.61774 > B.5005: . 113761:115209(1448) ack 1 win 4197
000681 IP B.5005 > A.61774: . ack 107969 win 7556
000001 IP B.5005 > A.61774: . ack 110865 win 4660
000001 IP B.5005 > A.61774: . ack 113761 win 1764
107508 IP B.5005 > A.61774: . ack 115209 win 8508
000002 IP B.5005 > A.61774: . ack 115209 win 16700
000002 IP B.5005 > A.61774: . ack 115209 win 24892
000001 IP B.5005 > A.61774: . ack 115209 win 33084
(I've edited away " <nop,nop,timestamp 1 1>" from all of the
lines, for readability.)
As you'll see, the receiver at B waits about 107ms before opening
up the receive window, allowing the sender at A to send again.
What is it doing during those 107ms? Waiting for some internal
timer in the kernel to fire?!?
I have tried to look at the CVS history for tcp_input.c, but I
can't offhand find anything which would explain this behaviour.
Am I the only one who observes this behaviour? (I suspect not.)
Any hints as to what the actual cause might be? Perhaps the
source of the problem isn't in the TCP stack at all, but rather
in the socket handling code and under what conditions the
application program is woken up?
Main Index |
Thread Index |