NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Weird network performance problem



Thanks for the good suggestions. I'll go ahead with some tcpdumps.

On Sun, 19 Jan 2020 at 15:49, Greg Troxel <gdt%lexort.com@localhost> wrote:
>
>   [lots of details]
>
> These things are somewhat tricky to debug.  There can be issues in the
> TCP stacks, issues with interfaces, and issues within the network.  I
> have a suspicion that there is something not 100% right about NetBSD's
> TCP retransmit behavior under fairly rare loss conditions, and you may
> be seeing that.  If you can reproduce this reliably we could perhaps
> figure it out.
>
> My advice is:
>
>   First figure out what's going on with the ethernet-over-powerline
>   taken out of the equation.

I tried already to eliminate it. One of the laptops - marked B,
usually running Fedora but with W10 as well, is connected directly to
the same gigabit switch; when it is running Fedora, the iperf3 results
to the NetBSD machine are as expected; when it is running W10, they
are about three times slower. It is in the second part of the message.
So there is something Windows specific, for the moment I will discount
the powerline adapters alltogether.

>
>   It looks like you are using vlan support on Y.  Try without also.

That may be something to look at. This is my NVMM host as well, every
boot I recreate tap[0..5] for use by the NVMM guests (but the tests
were done without any of them running).

I am not using vlans deliberately - the switch upstairs is a dumb one,
although the one downstaris is managed and has (unusued at the moment)
vlan support. The interfaces are created simply with /etc/ifconfig.wm0
- just 'inet 192.168.0.29 netmask 255.255.255.0 up description "My
LAN"' and /etc/ifconfig.bridge0 -

create
!ifconfig tap0 create up description "LxMint"
!ifconfig tap1 create up description "MXLinux"
!ifconfig tap2 create up description "FreeBSD12"
!ifconfig tap3 create up description "NBSDc"
!ifconfig tap4 create up description "OpenBSD"
!ifconfig tap5 create up description "Windows10"
!brconfig $int add wm0
!brconfig $int add tap0
!brconfig $int add tap1
!brconfig $int add tap2
!brconfig $int add tap3
!brconfig $int add tap4
!brconfig $int add tap5
up

so whatever is the default in these conditions is used.



>
>   Do some iperf3 testing with UDP.  This should more or less separate
>   loss from TCP's behavior in response to loss.  I am unclear on how
>   iperf3 deals with this, but it seems obvious that it can tell you what
>   fraction of the UDP packets it sent ended up arriving.
>
>   [not easy but worth it] install graphics/xplot-devel.  Read the info
>   about tcp plots.  Capture the data with tcpdump at the NetBSD server
>   end (with -w to a file).  More generally, capture data at the host
>   that is slow in transmitting; this gets that host's view of the
>   arriving acks.  Process the tcpdump output with tcpdump2xplot,
>   probably having to debug and fix the perl script to account for drift
>   in tcpdump format over time.  Or perhaps use a netbsd-5 tcpdump to
>   decode.  Then, learn how to read the plots, and look at the data.
>   This will let you see what packet loss there is, and how the TCP
>   sender responds to it.
>
> I can help you offlist with the xplot stuff, as I already understand
> this (my grad school officemate's thesis project).  It's on my todo list
> to update the parsing code to cope with more modern tcpdump, which I
> hope will stop rototilling the formats.
>
> One thing you said seemed odd:
>
>   I test the network speed using iperf3 on all these boxes. The speeds
>   upstairs, where all the machines are connected to the gigabit switch,
>   are roughly consistent - I get some 930Mbps both ways (there is a bit
>   of a speed ramp-up when the server is the NetBSD laptop, but after the
>   fifth or so transfer it gets to the same rates). The speeds are also
>
> Can you explain this more precisely, and maybe post a few summary lines?
> This doesn't really make sense to me.  Any given TCP connection has to
> ramp up the congestion window, but I would't expect a second one 30s
> later to benefit from the first -- but maybe there is some caching of
> RTT or something else?  After the speeds improve, how long can you wait
> before another test that is back to slower?  Going way out on a limb,
> this smells like caching of some parameters that leads to better
> handling packet loss, and the real issue is that the loss shouldn't be
> happening.

From the XCP-NG host to the NetBSD laptop:

$ iperf3 -c ymir.lorien.lan
Connecting to host ymir.lorien.lan, port 5201
[  4] local 192.168.0.5 port 36036 connected to 192.168.0.29 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  45.9 MBytes   385 Mbits/sec    0   66.5 KBytes
[  4]   1.00-2.00   sec  64.2 MBytes   539 Mbits/sec    0    100 KBytes
[  4]   2.00-3.00   sec  81.3 MBytes   682 Mbits/sec    0    132 KBytes
[  4]   3.00-4.00   sec  99.4 MBytes   834 Mbits/sec    0    163 KBytes
[  4]   4.00-5.00   sec   109 MBytes   911 Mbits/sec    0    205 KBytes
[  4]   5.00-6.00   sec   111 MBytes   928 Mbits/sec    0    205 KBytes
[  4]   6.00-7.00   sec   111 MBytes   928 Mbits/sec    0    205 KBytes
[  4]   7.00-8.00   sec   111 MBytes   932 Mbits/sec    0    205 KBytes
[  4]   8.00-9.00   sec   111 MBytes   930 Mbits/sec    0    205 KBytes
[  4]   9.00-10.00  sec   111 MBytes   932 Mbits/sec    0    205 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec   954 MBytes   800 Mbits/sec    0             sender
[  4]   0.00-10.00  sec   953 MBytes   800 Mbits/sec                  receiver

Starts a bit slower, but after the fourth interval reaches along the maximum.

When the server is the B laptop running W10, I get:

$ iperf3 -c brutus.lorien.lan
Connecting to host brutus.lorien.lan, port 5201
[  4] local 192.168.0.5 port 43654 connected to 192.168.0.36 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   106 MBytes   885 Mbits/sec    0    220 KBytes
[  4]   1.00-2.00   sec   108 MBytes   902 Mbits/sec    0    220 KBytes
[  4]   2.00-3.00   sec   112 MBytes   938 Mbits/sec    0    220 KBytes
[  4]   3.00-4.00   sec   111 MBytes   934 Mbits/sec    0    220 KBytes
[  4]   4.00-5.00   sec   112 MBytes   935 Mbits/sec    0    220 KBytes
[  4]   5.00-6.00   sec   112 MBytes   941 Mbits/sec    0    220 KBytes
[  4]   6.00-7.00   sec   112 MBytes   941 Mbits/sec    0    220 KBytes
[  4]   7.00-8.00   sec   109 MBytes   917 Mbits/sec    0    220 KBytes
[  4]   8.00-9.00   sec   112 MBytes   943 Mbits/sec    0    220 KBytes
[  4]   9.00-10.00  sec   112 MBytes   942 Mbits/sec    0    220 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  1.08 GBytes   928 Mbits/sec    0             sender
[  4]   0.00-10.00  sec  1.08 GBytes   928 Mbits/sec                  receiver

- e.g. from the start the speed is close to the max.

The lack of symetry is strange - from NetBSD to W10 - full speed; from
W10 to NetBSD - about a third... At the same time there is no
significant difference if instead of W10 you put Linux or FreeBSD -
both ways it is similar. And it can't be thrown at iperf3 on W10 only
- when the server is Linux or FreeBSD, the speed is as expected.










-- 
----


Home | Main Index | Thread Index | Old Index