Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Call for testing: New kernel heartbeat(9) checks



On 7/7/23 22:11, Taylor R Campbell wrote:
FYI: In 10.99.5, I just added a new kernel diagnostic subsystem called
heartbeat(9) that will make the system crash rather than hang when
CPUs are stuck in certain ways that hardware watchdog timers can't
detect (or on systems without hardware watchdog timers) > [...]

This is a NetBSD/amd64 guest with 2 virtual CPUs, running on VMware:

1.	cpuctl offline 0
	sleep 20
	cpuctl online 0

No panics.

2.	cpuctl offline 1
	sleep 20
	cpuctl online 1

No panics.

3.	cpuctl offline 0
	sysctl -w kern.heartbeat.max_period=5
	sleep 10
	sysctl -w kern.heartbeat.max_period=0
	sleep 10
	sysctl -w kern.heartbeat.max_period=15
	sleep 20
	cpuctl online 0

No panics.

4.	sysctl -w debug.crashme_enable=1
	sysctl -w debug.crashme.spl_spinout=2   # IPL_SOFTCLOCK
	# verify system panics after 15sec

Changing spl_spinout hangs sysctl. The kernel panics after 15 seconds:

Jul 8 22:16:13 netbsd-current /netbsd: [ 231.3581695] crashme_sysctl_forwarder:208: invoking "spl_spinout" (infinite loop at raised spl) Jul 8 22:16:13 netbsd-current /netbsd: [ 231.3581695] crashme_spl_spinout: raising ipl to 2 Jul 8 22:16:13 netbsd-current /netbsd: [ 231.3581695] crashme_spl_spinout: raised ipl to 2, s=0 Jul 8 22:16:13 netbsd-current /netbsd: [ 247.0084882] cpu0: found cpu1 heart stopped beating after 16 seconds Jul 8 22:16:13 netbsd-current /netbsd: [ 247.0084882] panic: cpu1[1743 sysctl]: heart stopped beating

5.	sysctl -w debug.crashme_enable=1
	sysctl -w debug.crashme.spl_spinout=6   # IPL_SCHED
	# verify system panics after 15sec

Like 4 but it panics with a different message:

Jul 8 22:23:24 netbsd-current /netbsd: [ 411.0078445] panic: cpu0: softints stuck for 16 seconds

6.	cpuctl offline 0
	sysctl -w debug.crashme_enable=1
	sysctl -w debug.crashme.spl_spinout=2   # IPL_SOFTCLOCK
	# verify system panics after 15sec

It panics after 15 seconds:

Jul 8 22:27:04 netbsd-current /netbsd: [ 200.0060379] panic: cpu1: softints stuck for 16 seconds

7.	cpuctl offline 0
	sysctl -w debug.crashme_enable=1
	sysctl -w debug.crashme.spl_spinout=5   # IPL_VM
	# verify system panics after 15sec

It panics after 15 seconds:

Jul 8 22:29:45 netbsd-current /netbsd: [ 142.0029650] panic: cpu1: softints stuck for 16 seconds

Home | Main Index | Thread Index | Old Index