Current-Users archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Call for testing: New kernel heartbeat(9) checks
On Fri, Jul 07, 2023 at 01:11:54PM +0000, Taylor R Campbell wrote:
> FYI: In 10.99.5, I just added a new kernel diagnostic subsystem called
> heartbeat(9) that will make the system crash rather than hang when
> CPUs are stuck in certain ways that hardware watchdog timers can't
> detect (or on systems without hardware watchdog timers).
>
> It's optional for now, but it's small and I'd like to make it
> mandatory in the future. If you'd like to try it out, add the
> following two lines to your kernel config:
>
> options HEARTBEAT
> options HEARTBEAT_MAX_PERIOD_DEFAULT=15
>
> You can disable it with `sysctl -w kern.heartbeat.max_period=0' at
> runtime, or use that knob to change the maximum period before the
> system will crash if not all (online) CPUs have made progress.
>
>
> Here are some manual tests that you can use to exercise it -- these
> are manual tests, not automatic tests, because some will deliberately
> crash the kernel to make sure the diagnostic works, and the others, if
> broken, will also crash the kernel.
>
> Notes:
> - The magic numbers for debug.crashme.spl_spinout are for evbarm.
> On x86, use IPL_SCHED=7, IPL_VM=6, and IPL_SOFTCLOCK=1.
> For other architectures, consult the source for the numbers to use.
> - If you're on a single-CPU system, skip the cpuctl offline/online
> tests and just do (4) and (5).
> - If you're on a >2-CPU system, then for the cpuctl offline/online
> tests, try offlining all CPUs but one at a time.
>
> 1. cpuctl offline 0
> sleep 20
> cpuctl online 0
With this I get a panic on Xen:
[ 225.4605386] panic: kernel diagnostic assertion "kpreempt_disabled()" failed: file "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", line 158
[ 225.4605386] cpu0: Begin traceback...
[ 225.4605386] vpanic() at netbsd:vpanic+0x163
[ 225.4605386] kern_assert() at netbsd:kern_assert+0x4b
[ 225.4705333] heartbeat_resume() at netbsd:heartbeat_resume+0x82
[ 225.4705333] cpu_xc_online() at netbsd:cpu_xc_online+0x11
[ 225.4705333] xc_thread() at netbsd:xc_thread+0xc8
[ 225.4705333] cpu0: End traceback...
[ 225.4705333] fatal breakpoint trap in supervisor mode
[ 225.4705333] trap type 1 code 0 rip 0xffffffff8022e96d cs 0xe030 rflags 0x202 cr2 0xffff9b8030d32000 ilevel 0 rsp 0xffff9b8030985dd0
[ 225.4705333] curlwp 0xffff9b80007c6900 pid 0.7 lowest kstack 0xffff9b80309812c0
Stopped in pid 0.7 (system) at netbsd:breakpoint+0x5: leave
breakpoint() at netbsd:breakpoint+0x5
vpanic() at netbsd:vpanic+0x163
kern_assert() at netbsd:kern_assert+0x4b
heartbeat_resume() at netbsd:heartbeat_resume+0x82
cpu_xc_online() at netbsd:cpu_xc_online+0x11
xc_thread() at netbsd:xc_thread+0xc8
Is it expected ? Nothing looks Xen-specific here
>
> 2. cpuctl offline 1
> sleep 20
> cpuctl online 1
same panic
>
> 3. cpuctl offline 0
> sysctl -w kern.heartbeat.max_period=5
> sleep 10
> sysctl -w kern.heartbeat.max_period=0
> sleep 10
> sysctl -w kern.heartbeat.max_period=15
> sleep 20
> cpuctl online 0
Here we have:
# sysctl -w kern.heartbeat.max_period=15
[ 53.5704682] panic: kernel diagnostic assertion "kpreempt_disabled()" failed: file "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", line 158
[ 53.5704682] cpu0: Begin traceback...
[ 53.5704682] vpanic() at netbsd:vpanic+0x163
[ 53.5704682] kern_assert() at netbsd:kern_assert+0x4b
[ 53.5704682] heartbeat_resume() at netbsd:heartbeat_resume+0x82
[ 53.5704682] xc_thread() at netbsd:xc_thread+0xc8
[ 53.5704682] cpu0: End traceback...
>
> 4. sysctl -w debug.crashme_enable=1
> sysctl -w debug.crashme.spl_spinout=1 # IPL_SOFTCLOCK
> # verify system panics after 15sec
my sysctl command did hang, but the system didn't panic
>
> 5. sysctl -w debug.crashme_enable=1
> sysctl -w debug.crashme.spl_spinout=6 # IPL_SCHED
> # verify system panics after 15sec
This one did panic
>
> 6. cpuctl offline 0
> sysctl -w debug.crashme_enable=1
> sysctl -w debug.crashme.spl_spinout=1 # IPL_SOFTCLOCK
> # verify system panics after 15sec
my sysctl command did hang, but the system didn't panic
>
> 7. cpuctl offline 0
> sysctl -w debug.crashme_enable=1
> sysctl -w debug.crashme.spl_spinout=5 # IPL_VM
> # verify system panics after 15sec
and this one did panic
--
Manuel Bouyer <bouyer%antioche.eu.org@localhost>
NetBSD: 26 ans d'experience feront toujours la difference
--
Home |
Main Index |
Thread Index |
Old Index