Yes I think it is somehow related to the new ARP code in nd.c. New datapoint:Reversing the transfer to send from DOM0 to guest survive longer. DOM0 is fine. on the guest the EHOSTDOWN return in nd.c:~390 is triggered often. As failure of sending ACK does not terminate TCP connections that is why the connection survives. The timing pattern seems to be a mixture of 200ms (possible ACK re-sends) and ~41 seconds (possibly the nd.c effect).
The EHOSTDOWN pattern looks like this: [ 2174.429719] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2174.639722] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2217.636882] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2217.841108] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2218.051081] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2218.261120] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2259.258196] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2259.462455] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2259.672445] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2300.669503] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2300.873729] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2301.083775] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2301.293752] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2342.290807] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2342.495060] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2342.705039] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2342.915041] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2383.912120] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2384.116365] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2384.326380] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2384.536361] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2427.533491] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2427.737675] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2427.947745] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2428.157763] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2469.154813] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2469.358585] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN [ 2469.568578] /src/NetBSD/999100/src/sys/net/nd.c:391: EHOSTDOWN Best regards, Frank On 10/07/22 17:10, Manuel Bouyer wrote:
The following reply was made to PR kern/57049; it has been noted by GNATS. From: Manuel Bouyer <bouyer%antioche.eu.org@localhost> To: Frank Kardel <kardel%netbsd.org@localhost> Cc: gnats-bugs%netbsd.org@localhost, kern-bug-people%netbsd.org@localhost, gnats-admin%netbsd.org@localhost, netbsd-bugs%netbsd.org@localhost Subject: Re: kern/57049: large TCP transfers NetBSD-Xen-Guest -> NetBSD-Xen-DOM0 abort with EHOSTDOWN Date: Fri, 7 Oct 2022 17:07:37 +0200 On Fri, Oct 07, 2022 at 04:43:52PM +0200, Frank Kardel wrote: > Hi Manuel, > > that is probably because the DOMU is 9.2 which still had the classic ARP > resolution code. In 9.99.x the ARP resolution > > was replaced with a neighbour discovery derived code in nd.c. On Xen I > tripped over this issue with a 99.100 GENERIC guest quickly. It may be that > it > > happens with other true network peers also, but I was not able to trigger it > with a true network peer right away.OK, with a HEAD domU I can reproduce this.But I don't think this is Xen-specific. Maybe it's just some timing or ressource issue that makes it more likely to happen on Xen.--Manuel Bouyer <bouyer%antioche.eu.org@localhost> NetBSD: 26 ans d'experience feront toujours la difference --