NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/58744: two softint stuck crashes softint threads look broken
>Number: 58744
>Category: kern
>Synopsis: two softint stuck crashes softint threads look broken
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Oct 12 01:15:01 +0000 2024
>Originator: matthew green
>Release: NetBSD 10.99.12
>Organization:
people's front against (bozotic) www (softwar foundation)
>Environment:
System: NetBSD aches.eterna23.net 10.99.12 NetBSD 10.99.12 (_aches_) #52: Sat Sep 14 15:08:46 CDT 2024 mrg%aches.eterna23.net@localhost:/var/obj/amd64-x86_64/usr/src/sys/arch/amd64/compile/_aches_ amd64
Architecture: amd64
>Description:
a ryzen 5600G system had two recent crashes with the softint
heartbeat firing at 301 seconds (already pushed out from an
earlier bug in heartbeast, forgot about that.)
both times the system was doing minimal CPU work, but it was
doing 20-50MB/sec over the network and to/from nvme.
here's a crash(8) session with them both:
aches /var/crash> crash -N netbsd.gdb -M netbsd.10.core
Crash version 10.99.12, image version 10.99.12.
crash: _kvm_kvatop(0)
Kernel compiled without options LOCKDEBUG.
System panicked: cpu0: softints stuck for 301 seconds
Backtrace from time of crash is available.
crash> bt
end() at 0
kern_reboot() at kern_reboot+0x93
vpanic() at vpanic+0x17b
panic() at printf_nostamp
heartbeat() at heartbeat+0x34c
hardclock() at hardclock+0x8b
Xresume_lapic_ltimer() at Xresume_lapic_ltimer+0x1e
--- interrupt ---
bus_space_read_stream_2() at bus_space_read_stream_2+0xb
intr_wrapper() at intr_wrapper+0x4b
intr_biglock_wrapper() at intr_biglock_wrapper+0x1e
Xhandle_ioapic_edge22() at Xhandle_ioapic_edge22+0x75
--- interrupt ---
Xspllower() at Xspllower+0xe
softint_dispatch() at softint_dispatch+0x112
DDB lost frame for Xsoftintr+0x4c, trying 0xffffa310941220f0
Xsoftintr() at Xsoftintr+0x4c
--- interrupt ---
6ad574a4669574a0:
crash> ps|grep '>'
5934 > 5934 7 3 8020100 ffff85785c9ff400 tcsh
16353>10210 7 4 8020000 ffff856d79dfd400 dae
289 > 289 7 5 8020100 ffff856d55ecec00 tcsh
10093>10093 7 11 8020100 ffff856d5213c000 tcsh
0 > 411 7 1 240 ffff856d41887c00 ioflush
0 > 406 7 2 200 ffff856d41050800 raidio2
0 > 208 1 10 201 ffff856d3f1a6c00 idle/10
0 > 202 1 9 201 ffff856d3f117400 idle/9
0 > 196 1 8 201 ffff856d3f05cc00 idle/8
0 > 124 1 7 201 ffff856d3efdd400 idle/7
0 > 122 7 6 200 ffff856d3efa1c00 softser/6
0 > 121 7 6 200 ffff856d3efa1800 softclk/6
0 > 118 1 6 201 ffff856d3ef12c00 idle/6
0 > 6 7 0 200 ffff857c349dc000 softser/0
0 > 2 1 0 201 ffff857c34a1b000 idle/0
crash> bt/a ffff857c349dc000
trace: pid 0 lid 6 at 0xffffa310941220e0
softint_dispatch() at softint_dispatch+0x112
DDB lost frame for Xsoftintr+0x4c, trying 0xffffa310941220f0
Xsoftintr() at Xsoftintr+0x4c
--- interrupt ---
6ad574a4669574a0:
crash> bt/a ffff857c34a1b000
trace: pid 0 lid 2 at 0xffffa31093cbafc0
acpicpu_cstate_idle() at acpicpu_cstate_idle+0x19a
idle_loop() at idle_loop+0x128
crash> bt/a ffff856d3efa1c00
trace: pid 0 lid 122 at 0xffffa310943340e0
softint_dispatch() at softint_dispatch+0x3b2
DDB lost frame for Xsoftintr+0x4c, trying 0xffffa310943340f0
Xsoftintr() at Xsoftintr+0x4c
--- interrupt ---
eff43a56e3b43a52:
crash> bt/a ffff856d3efa1800
trace: pid 0 lid 121 at 0xffffa3109432d0e0
softint_dispatch() at softint_dispatch+0x112
DDB lost frame for Xsoftintr+0x4c, trying 0xffffa3109432d0f0
Xsoftintr() at Xsoftintr+0x4c
--- interrupt ---
c65a6020ca1a6024:
aches /var/crash> crash -N netbsd.gdb -M netbsd.11.core
Crash version 10.99.12, image version 10.99.12.
crash: _kvm_kvatop(0)
Kernel compiled without options LOCKDEBUG.
System panicked: cpu0: softints stuck for 301 seconds
Backtrace from time of crash is available.
crash> bt
end() at 0
kern_reboot() at kern_reboot+0x93
vpanic() at vpanic+0x17b
panic() at printf_nostamp
heartbeat() at heartbeat+0x34c
hardclock() at hardclock+0x8b
Xresume_lapic_ltimer() at Xresume_lapic_ltimer+0x1e
--- interrupt ---
bus_space_read_stream_2() at bus_space_read_stream_2+0xb
intr_wrapper() at intr_wrapper+0x4b
intr_biglock_wrapper() at intr_biglock_wrapper+0x1e
Xhandle_ioapic_edge22() at Xhandle_ioapic_edge22+0x75
--- interrupt ---
Xspllower() at Xspllower+0xe
softint_dispatch() at softint_dispatch+0x3b2
DDB lost frame for Xsoftintr+0x4c, trying 0xffff8d109410d0f0
Xsoftintr() at Xsoftintr+0x4c
--- interrupt ---
0:
crash> ps|grep '>'
991 > 991 7 7 8020000 ffff8822170ef000 minidlnad
2124 > 2124 7 8 8020100 ffff88166660bc00 screen
0 > 412 7 3 240 ffff881667077c00 ioflush
0 > 406 7 4 200 ffff881666850800 raidio2
0 > 404 7 2 200 ffff881666850000 raidio6
0 > 178 7 10 240 ffff881666600400 usb4
0 > 176 7 5 240 ffff881666594c00 usb2
0 > 175 7 9 240 ffff881666594800 usb1
0 > 174 7 11 240 ffff881666594400 usb0
0 > 118 1 6 201 ffff881664712c00 idle/6
0 > 27 7 1 200 ffff88166445d400 softser/1
0 > 26 7 1 200 ffff88166445d000 softclk/1
0 > 23 1 1 201 ffff88255214a400 idle/1
0 > 3 7 0 200 ffff88255a21b400 softnet/0
0 > 2 1 0 201 ffff88255a21b000 idle/0
crash> bt/a ffff88255a21b000
trace: pid 0 lid 2 at 0xffff8d1093cbafc0
acpicpu_cstate_idle() at acpicpu_cstate_idle+0x19a
idle_loop() at idle_loop+0x128
crash> bt/a ffff88255a21b400
trace: pid 0 lid 3 at 0xffff8d109410d0e0
softint_dispatch() at softint_dispatch+0x3b2
DDB lost frame for Xsoftintr+0x4c, trying 0xffff8d109410d0f0
Xsoftintr() at Xsoftintr+0x4c
--- interrupt ---
0:
crash> bt/a ffff88166445d400
trace: pid 0 lid 27 at 0xffff8d1093b6a000
_KERNEL_OPT_PMS_DISABLE_POWERHOOK() at ffff88255212e280
crash> bt/a ffff88166445d000
trace: pid 0 lid 26 at 0xffff8d10941fb0e0
softint_dispatch() at softint_dispatch+0x112
DDB lost frame for Xsoftintr+0x4c, trying 0xffff8d10941fb0f0
Xsoftintr() at Xsoftintr+0x4c
--- interrupt ---
0:
in both cases it seems that softser/softnet thread on
CPU0 has been fast switched to from idle/0 thread, but
it never comes back? in both cases, the softint thread
(softser or softnet) that was fast switched to has a
tiny stack frame indicating softint_dispatch() was the
last call, so this *probably* means some softint handler
is doing something wrong (tm).
i have both core files and netbsd.gdb so can investigate
both via gdb and crash, as long as the system remains
functional and accessible remotely.
>How-To-Repeat:
>Fix:
Home |
Main Index |
Thread Index |
Old Index