Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Call for testing: New kernel heartbeat(9) checks



FYI: In 10.99.5, I just added a new kernel diagnostic subsystem called
heartbeat(9) that will make the system crash rather than hang when
CPUs are stuck in certain ways that hardware watchdog timers can't
detect (or on systems without hardware watchdog timers).

It's optional for now, but it's small and I'd like to make it
mandatory in the future.  If you'd like to try it out, add the
following two lines to your kernel config:

options 	HEARTBEAT
options 	HEARTBEAT_MAX_PERIOD_DEFAULT=15

You can disable it with `sysctl -w kern.heartbeat.max_period=0' at
runtime, or use that knob to change the maximum period before the
system will crash if not all (online) CPUs have made progress.


Here are some manual tests that you can use to exercise it -- these
are manual tests, not automatic tests, because some will deliberately
crash the kernel to make sure the diagnostic works, and the others, if
broken, will also crash the kernel.

Notes:
- The magic numbers for debug.crashme.spl_spinout are for evbarm.
  On x86, use IPL_SCHED=7, IPL_VM=6, and IPL_SOFTCLOCK=1.
  For other architectures, consult the source for the numbers to use.
- If you're on a single-CPU system, skip the cpuctl offline/online
  tests and just do (4) and (5).
- If you're on a >2-CPU system, then for the cpuctl offline/online
  tests, try offlining all CPUs but one at a time.

1.	cpuctl offline 0
	sleep 20
	cpuctl online 0

2.	cpuctl offline 1
	sleep 20
	cpuctl online 1

3.	cpuctl offline 0
	sysctl -w kern.heartbeat.max_period=5
	sleep 10
	sysctl -w kern.heartbeat.max_period=0
	sleep 10
	sysctl -w kern.heartbeat.max_period=15
	sleep 20
	cpuctl online 0

4.	sysctl -w debug.crashme_enable=1
	sysctl -w debug.crashme.spl_spinout=1   # IPL_SOFTCLOCK
	# verify system panics after 15sec

5.	sysctl -w debug.crashme_enable=1
	sysctl -w debug.crashme.spl_spinout=6   # IPL_SCHED
	# verify system panics after 15sec

6.	cpuctl offline 0
	sysctl -w debug.crashme_enable=1
	sysctl -w debug.crashme.spl_spinout=1   # IPL_SOFTCLOCK
	# verify system panics after 15sec

7.	cpuctl offline 0
	sysctl -w debug.crashme_enable=1
	sysctl -w debug.crashme.spl_spinout=5   # IPL_VM
	# verify system panics after 15sec


Home | Main Index | Thread Index | Old Index