port-sparc64/58311: tlp network dies after a while on sparc64

To: port-sparc64-maintainer%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: port-sparc64/58311: tlp network dies after a while on sparc64
From: 2857%gmx.de@localhost
Date: Wed, 5 Jun 2024 01:35:01 +0000 (UTC)

>Number:         58311
>Category:       port-sparc64
>Synopsis:       tlp network dies after a while on sparc64
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    port-sparc64-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Jun 05 01:35:00 +0000 2024
>Originator:     zip100
>Release:        NetBSD 10.0
>Organization:
>Environment:
NetBSD sunfish 10.0_STABLE NetBSD 10.0_STABLE (GENERIC) #0: Sat May 25 00:38:27 UTC 2024  builder@netbsd:/home/builder/obj/sys/arch/sparc64/compile/GENERIC sparc64

SunFire v100, setup from an image built from NetBSD 10.0 branch on 25th of May 2024. Standard kernel without any changes.


>Description:
Everything works correctly for a while. After some time (~1 hour), machine becomes unreachable from the outside (same LAN) network, so pings and SSH fail. At the same time, if one to use serial console, it's possible to ping and download stuff from the Internet from inside the machine. Pinging LAN hosts from the inside works 50% of the time, mostly when affected machine pings them first (I guess to establish ARP entry) and then (not always) they can reach it back.


sunfish is the hostname of affected machine (192.168.33.64),
192.168.33.1 is the router and .33 and .14 are another machines on the same subnet.


sunfish# ifconfig
tlp0: flags=0x8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        ec_capabilities=0x1<VLAN_MTU>
        ec_enabled=0
        address: 00:03:ba:36:42:cc
        media: Ethernet autoselect (100baseTX full-duplex)
        status: active
        inet6 xxxx::xxx:xxxx:xxxx:xxxx%tlp0/64 flags 0 scopeid 0x1
        inet 192.168.33.64/24 broadcast 192.168.33.255 flags 0

sunfish# arp -na
? (192.168.33.14) at 4c:32:75:9f:62:43 on tlp0 46s W
? (192.168.33.33) at 18:c0:4d:0c:62:30 on tlp0 23h58m4s S
? (192.168.33.1) at 1a:fd:74:78:70:12 on tlp0 23h59m45s S


sunfish# ping 192.168.33.33
PING 192.168.33.33: 56 data bytes
64 bytes from 192.168.33.33: icmp_seq=0 ttl=64 time=0.696240 ms
64 bytes from 192.168.33.33: icmp_seq=1 ttl=64 time=0.308880 ms
^C
-

sunfish# ping 192.168.33.14
PING 192.168.33.14 (192.168.33.14): 56 data bytes
ping: sendto: Host is down
ping: sendto: Host is down
^C

sunfish# netstat -p icmp
icmp:
        0 calls to icmp_error
        0 errors not generated because old message was icmp
        Output histogram:
                echoreply: 131
        0 messages with bad code fields
        0 messages < minimum length
        0 bad checksums
        0 messages with bad length
        0 multicast echo requests ignored
        0 multicast timestamp requests ignored
        Input histogram:
                echoreply: 28
                echo: 131
        131 message responses generated
        0 path MTU changes

Now, .33 can ping sunfish successfully, while .14 can not (both .33 and .14 are 100% up and can ping each other and are accessible from the router).

Another detail, .33 and .64 are connected to the same switch, and .14 is connected through the router:

.14 <---> router <---> switch : .33, .64


Running tcpdump on both .14 and sunfish reveals that pings are recorded, they're just not being processed by the ping command on .14 . Even more interesting, "netstat -p icmp" on sunfish shows no errors and "message responses generated" counter increments on every ping! But sadly these pings are not registered by originating system (.14)
>How-To-Repeat:

>Fix:

Prev by Date: port-sparc/58310: sparc lacks dtrace support
Next by Date: PR/58303 CVS commit: src/sys/arch
Previous by Thread: port-sparc/58310: sparc lacks dtrace support
Next by Thread: PR/58303 CVS commit: src/sys/arch
Indexes:

Home | Main Index | Thread Index | Old Index