NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/54645: Netbooting over a direct cable connection is unreliable

>Number:         54645
>Category:       kern
>Synopsis:       Netbooting over a direct cable connection is unreliable
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Oct 23 16:25:00 +0000 2019
>Originator:     Andreas Gustafsson
>Release:        NetBSD 9.0_BETA

System: NetBSD
Architecture: x86_64
Machine: amd64

I have an automated testing setup that netboots an INSTALL kernel and
performs a scripted installation of NetBSD/amd64 on physical hardware.
It consists of two HP DL360 G7 machines, one acting as the netboot
server and the other as the client.  Both have quad bnx(4) gigabit
Ethernet interfaces.

To avoid any risk of accidentally auto-installing on the wrong
machine, the network connecting the server and client is physically
separate and used to simply consist of a short patch cable directly
connecting a port on the server to a port on the client (not even
using a crossover cable, but rather relying on "Auto MDI-X").

This worked fairly reliably until recently, when a significant
fraction of the test runs started failing.  This may or may not have
been triggered by upgrading the server from NetBSD 8 to 9.0_BETA.

Examining the network traffic using tcpdump, it looks like the server
is receiving packets from the client at about the expected times, but
not with the expected contents; rather, the packets received look like
duplicates of packets the server has received earlier.  For example,
in one case, the first packet received by the server after the client
was powered on was not the expected BOOTP packet, but a TCP packet to
port 80 (without a SYN flag, so from an existing connection).

Since the client can't actually have sent that immediately after
power-on, I suspect this is a server-side issue where the server
somehow gets confused after losing and regaining carrier on the
interface, causing it to inject packets from the wrong part of its
receive ring buffer into the network stack.

That the problem is triggered by a temporary loss of carrier is
supported by the fact that the problem has not recurred after
I added an Ethernet switch between the two machines.

I can make pcap files available on request.



Home | Main Index | Thread Index | Old Index