NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/58744: two softint stuck crashes softint threads look broken



>Number:         58744
>Category:       kern
>Synopsis:       two softint stuck crashes softint threads look broken
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Oct 12 01:15:01 +0000 2024
>Originator:     matthew green
>Release:        NetBSD 10.99.12
>Organization:
people's front against (bozotic) www (softwar foundation)
>Environment:
System:  NetBSD aches.eterna23.net 10.99.12 NetBSD 10.99.12 (_aches_) #52: Sat Sep 14 15:08:46 CDT 2024  mrg%aches.eterna23.net@localhost:/var/obj/amd64-x86_64/usr/src/sys/arch/amd64/compile/_aches_ amd64
Architecture: amd64
>Description:

	a ryzen 5600G system had two recent crashes with the softint
	heartbeat firing at 301 seconds (already pushed out from an
	earlier bug in heartbeast, forgot about that.)

	both times the system was doing minimal CPU work, but it was
	doing 20-50MB/sec over the network and to/from nvme.

	here's a crash(8) session with them both:

aches /var/crash> crash -N netbsd.gdb -M netbsd.10.core
Crash version 10.99.12, image version 10.99.12.
crash: _kvm_kvatop(0)
Kernel compiled without options LOCKDEBUG.
System panicked: cpu0: softints stuck for 301 seconds
Backtrace from time of crash is available.
crash> bt
end() at 0
kern_reboot() at kern_reboot+0x93
vpanic() at vpanic+0x17b
panic() at printf_nostamp
heartbeat() at heartbeat+0x34c
hardclock() at hardclock+0x8b
Xresume_lapic_ltimer() at Xresume_lapic_ltimer+0x1e
--- interrupt ---
bus_space_read_stream_2() at bus_space_read_stream_2+0xb
intr_wrapper() at intr_wrapper+0x4b
intr_biglock_wrapper() at intr_biglock_wrapper+0x1e
Xhandle_ioapic_edge22() at Xhandle_ioapic_edge22+0x75
--- interrupt ---
Xspllower() at Xspllower+0xe
softint_dispatch() at softint_dispatch+0x112
DDB lost frame for Xsoftintr+0x4c, trying 0xffffa310941220f0
Xsoftintr() at Xsoftintr+0x4c
--- interrupt ---
6ad574a4669574a0:
crash> ps|grep '>'
5934 > 5934 7   3   8020100   ffff85785c9ff400               tcsh
16353>10210 7   4   8020000   ffff856d79dfd400                dae
289  >  289 7   5   8020100   ffff856d55ecec00               tcsh
10093>10093 7  11   8020100   ffff856d5213c000               tcsh
0    >  411 7   1       240   ffff856d41887c00            ioflush
0    >  406 7   2       200   ffff856d41050800            raidio2
0    >  208 1  10       201   ffff856d3f1a6c00            idle/10
0    >  202 1   9       201   ffff856d3f117400             idle/9
0    >  196 1   8       201   ffff856d3f05cc00             idle/8
0    >  124 1   7       201   ffff856d3efdd400             idle/7
0    >  122 7   6       200   ffff856d3efa1c00          softser/6
0    >  121 7   6       200   ffff856d3efa1800          softclk/6
0    >  118 1   6       201   ffff856d3ef12c00             idle/6
0    >    6 7   0       200   ffff857c349dc000          softser/0
0    >    2 1   0       201   ffff857c34a1b000             idle/0
crash> bt/a ffff857c349dc000
trace: pid 0 lid 6 at 0xffffa310941220e0
softint_dispatch() at softint_dispatch+0x112
DDB lost frame for Xsoftintr+0x4c, trying 0xffffa310941220f0
Xsoftintr() at Xsoftintr+0x4c
--- interrupt ---
6ad574a4669574a0:
crash> bt/a ffff857c34a1b000
trace: pid 0 lid 2 at 0xffffa31093cbafc0
acpicpu_cstate_idle() at acpicpu_cstate_idle+0x19a
idle_loop() at idle_loop+0x128
crash> bt/a ffff856d3efa1c00
trace: pid 0 lid 122 at 0xffffa310943340e0
softint_dispatch() at softint_dispatch+0x3b2
DDB lost frame for Xsoftintr+0x4c, trying 0xffffa310943340f0
Xsoftintr() at Xsoftintr+0x4c
--- interrupt ---
eff43a56e3b43a52:
crash> bt/a ffff856d3efa1800
trace: pid 0 lid 121 at 0xffffa3109432d0e0
softint_dispatch() at softint_dispatch+0x112
DDB lost frame for Xsoftintr+0x4c, trying 0xffffa3109432d0f0
Xsoftintr() at Xsoftintr+0x4c
--- interrupt ---
c65a6020ca1a6024:

aches /var/crash> crash -N netbsd.gdb -M netbsd.11.core
Crash version 10.99.12, image version 10.99.12.
crash: _kvm_kvatop(0)
Kernel compiled without options LOCKDEBUG.
System panicked: cpu0: softints stuck for 301 seconds
Backtrace from time of crash is available.
crash> bt
end() at 0
kern_reboot() at kern_reboot+0x93
vpanic() at vpanic+0x17b
panic() at printf_nostamp
heartbeat() at heartbeat+0x34c
hardclock() at hardclock+0x8b
Xresume_lapic_ltimer() at Xresume_lapic_ltimer+0x1e
--- interrupt ---
bus_space_read_stream_2() at bus_space_read_stream_2+0xb
intr_wrapper() at intr_wrapper+0x4b
intr_biglock_wrapper() at intr_biglock_wrapper+0x1e
Xhandle_ioapic_edge22() at Xhandle_ioapic_edge22+0x75
--- interrupt ---
Xspllower() at Xspllower+0xe
softint_dispatch() at softint_dispatch+0x3b2
DDB lost frame for Xsoftintr+0x4c, trying 0xffff8d109410d0f0
Xsoftintr() at Xsoftintr+0x4c
--- interrupt ---
0:
crash> ps|grep '>'                             
991  >  991 7   7   8020000   ffff8822170ef000          minidlnad
2124 > 2124 7   8   8020100   ffff88166660bc00             screen
0    >  412 7   3       240   ffff881667077c00            ioflush
0    >  406 7   4       200   ffff881666850800            raidio2
0    >  404 7   2       200   ffff881666850000            raidio6
0    >  178 7  10       240   ffff881666600400               usb4
0    >  176 7   5       240   ffff881666594c00               usb2
0    >  175 7   9       240   ffff881666594800               usb1
0    >  174 7  11       240   ffff881666594400               usb0
0    >  118 1   6       201   ffff881664712c00             idle/6
0    >   27 7   1       200   ffff88166445d400          softser/1
0    >   26 7   1       200   ffff88166445d000          softclk/1
0    >   23 1   1       201   ffff88255214a400             idle/1
0    >    3 7   0       200   ffff88255a21b400          softnet/0
0    >    2 1   0       201   ffff88255a21b000             idle/0
crash> bt/a ffff88255a21b000
trace: pid 0 lid 2 at 0xffff8d1093cbafc0
acpicpu_cstate_idle() at acpicpu_cstate_idle+0x19a
idle_loop() at idle_loop+0x128
crash> bt/a ffff88255a21b400
trace: pid 0 lid 3 at 0xffff8d109410d0e0
softint_dispatch() at softint_dispatch+0x3b2
DDB lost frame for Xsoftintr+0x4c, trying 0xffff8d109410d0f0
Xsoftintr() at Xsoftintr+0x4c
--- interrupt ---
0:
crash> bt/a ffff88166445d400
trace: pid 0 lid 27 at 0xffff8d1093b6a000
_KERNEL_OPT_PMS_DISABLE_POWERHOOK() at ffff88255212e280
crash> bt/a ffff88166445d000
trace: pid 0 lid 26 at 0xffff8d10941fb0e0
softint_dispatch() at softint_dispatch+0x112
DDB lost frame for Xsoftintr+0x4c, trying 0xffff8d10941fb0f0
Xsoftintr() at Xsoftintr+0x4c
--- interrupt ---
0:

	in both cases it seems that softser/softnet thread on
	CPU0 has been fast switched to from idle/0 thread, but
	it never comes back?  in both cases, the softint thread
	(softser or softnet) that was fast switched to has a
	tiny stack frame indicating softint_dispatch() was the
	last call, so this *probably* means some softint handler
	is doing something wrong (tm).

	i have both core files and netbsd.gdb so can investigate
	both via gdb and crash, as long as the system remains
	functional and accessible remotely.

>How-To-Repeat:
>Fix:



Home | Main Index | Thread Index | Old Index