Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Call for testing: New kernel heartbeat(9) checks



On Fri, Jul 07, 2023 at 01:11:54PM +0000, Taylor R Campbell wrote:
> FYI: In 10.99.5, I just added a new kernel diagnostic subsystem called
> heartbeat(9) that will make the system crash rather than hang when
> CPUs are stuck in certain ways that hardware watchdog timers can't
> detect (or on systems without hardware watchdog timers).
> 
> It's optional for now, but it's small and I'd like to make it
> mandatory in the future.  If you'd like to try it out, add the
> following two lines to your kernel config:
> 
> options 	HEARTBEAT
> options 	HEARTBEAT_MAX_PERIOD_DEFAULT=15
> 
> You can disable it with `sysctl -w kern.heartbeat.max_period=0' at
> runtime, or use that knob to change the maximum period before the
> system will crash if not all (online) CPUs have made progress.
> 
> 
> Here are some manual tests that you can use to exercise it -- these
> are manual tests, not automatic tests, because some will deliberately
> crash the kernel to make sure the diagnostic works, and the others, if
> broken, will also crash the kernel.
> 
> Notes:
> - The magic numbers for debug.crashme.spl_spinout are for evbarm.
>   On x86, use IPL_SCHED=7, IPL_VM=6, and IPL_SOFTCLOCK=1.
>   For other architectures, consult the source for the numbers to use.
> - If you're on a single-CPU system, skip the cpuctl offline/online
>   tests and just do (4) and (5).
> - If you're on a >2-CPU system, then for the cpuctl offline/online
>   tests, try offlining all CPUs but one at a time.
> 
> 1.	cpuctl offline 0
> 	sleep 20
> 	cpuctl online 0

With this I get a panic on Xen:
[ 225.4605386] panic: kernel diagnostic assertion "kpreempt_disabled()" failed: file "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", line 158
[ 225.4605386] cpu0: Begin traceback...
[ 225.4605386] vpanic() at netbsd:vpanic+0x163
[ 225.4605386] kern_assert() at netbsd:kern_assert+0x4b
[ 225.4705333] heartbeat_resume() at netbsd:heartbeat_resume+0x82
[ 225.4705333] cpu_xc_online() at netbsd:cpu_xc_online+0x11
[ 225.4705333] xc_thread() at netbsd:xc_thread+0xc8
[ 225.4705333] cpu0: End traceback...
[ 225.4705333] fatal breakpoint trap in supervisor mode
[ 225.4705333] trap type 1 code 0 rip 0xffffffff8022e96d cs 0xe030 rflags 0x202 cr2 0xffff9b8030d32000 ilevel 0 rsp 0xffff9b8030985dd0
[ 225.4705333] curlwp 0xffff9b80007c6900 pid 0.7 lowest kstack 0xffff9b80309812c0
Stopped in pid 0.7 (system) at  netbsd:breakpoint+0x5:  leave
breakpoint() at netbsd:breakpoint+0x5
vpanic() at netbsd:vpanic+0x163
kern_assert() at netbsd:kern_assert+0x4b
heartbeat_resume() at netbsd:heartbeat_resume+0x82
cpu_xc_online() at netbsd:cpu_xc_online+0x11
xc_thread() at netbsd:xc_thread+0xc8

Is it expected ? Nothing looks Xen-specific here


> 
> 2.	cpuctl offline 1
> 	sleep 20
> 	cpuctl online 1

same panic

> 
> 3.	cpuctl offline 0
> 	sysctl -w kern.heartbeat.max_period=5
> 	sleep 10
> 	sysctl -w kern.heartbeat.max_period=0
> 	sleep 10
> 	sysctl -w kern.heartbeat.max_period=15
> 	sleep 20
> 	cpuctl online 0

Here we have:
#        sysctl -w kern.heartbeat.max_period=15
[  53.5704682] panic: kernel diagnostic assertion "kpreempt_disabled()" failed: file "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", line 158
[  53.5704682] cpu0: Begin traceback...
[  53.5704682] vpanic() at netbsd:vpanic+0x163
[  53.5704682] kern_assert() at netbsd:kern_assert+0x4b
[  53.5704682] heartbeat_resume() at netbsd:heartbeat_resume+0x82
[  53.5704682] xc_thread() at netbsd:xc_thread+0xc8
[  53.5704682] cpu0: End traceback...


> 
> 4.	sysctl -w debug.crashme_enable=1
> 	sysctl -w debug.crashme.spl_spinout=1   # IPL_SOFTCLOCK
> 	# verify system panics after 15sec

my sysctl command did hang, but the system didn't panic

> 
> 5.	sysctl -w debug.crashme_enable=1
> 	sysctl -w debug.crashme.spl_spinout=6   # IPL_SCHED
> 	# verify system panics after 15sec

This one did panic
> 
> 6.	cpuctl offline 0
> 	sysctl -w debug.crashme_enable=1
> 	sysctl -w debug.crashme.spl_spinout=1   # IPL_SOFTCLOCK
> 	# verify system panics after 15sec

my sysctl command did hang, but the system didn't panic

> 
> 7.	cpuctl offline 0
> 	sysctl -w debug.crashme_enable=1
> 	sysctl -w debug.crashme.spl_spinout=5   # IPL_VM
> 	# verify system panics after 15sec

and this one did panic

-- 
Manuel Bouyer <bouyer%antioche.eu.org@localhost>
     NetBSD: 26 ans d'experience feront toujours la difference
--


Home | Main Index | Thread Index | Old Index