Source-Changes-HG archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

[src/trunk]: src/sys/sys heartbeat(9): New mechanism to check progress of ker...



details:   https://anonhg.NetBSD.org/src/rev/daef80317129
branches:  trunk
changeset: 377320:daef80317129
user:      riastradh <riastradh%NetBSD.org@localhost>
date:      Fri Jul 07 12:34:49 2023 +0000

description:
heartbeat(9): New mechanism to check progress of kernel.

This uses hard interrupts to check progress of low-priority soft
interrupts, and one CPU to check progress of another CPU.

If no progress has been made after a configurable number of seconds
(kern.heartbeat.max_period, default 15), then the system panics --
preferably on the CPU that is stuck so we get a stack trace in dmesg
of where it was stuck, but if the stuckness was detected by another
CPU and the stuck CPU doesn't acknowledge the request to panic within
one second, the detecting CPU panics instead.

This doesn't supplant hardware watchdog timers.  It is possible for
hard interrupts to be stuck on all CPUs for some reason too; in that
case heartbeat(9) has no opportunity to complete.

Downside: heartbeat(9) relies on hardclock to run at a reasonably
consistent rate, which might cause trouble for the glorious tickless
future.  However, it could be adapted to take a parameter for an
approximate number of units that have elapsed since the last call on
the current CPU, rather than treating that as a constant 1.

XXX kernel revbump -- changes struct cpu_info layout

diffstat:

 share/man/man9/heartbeat.9 |  169 +++++++++++
 sys/kern/files.kern        |    5 +-
 sys/kern/init_main.c       |   14 +-
 sys/kern/kern_clock.c      |   13 +-
 sys/kern/kern_cpu.c        |   14 +-
 sys/kern/kern_heartbeat.c  |  656 +++++++++++++++++++++++++++++++++++++++++++++
 sys/sys/cpu_data.h         |   11 +-
 sys/sys/heartbeat.h        |   53 +++
 8 files changed, 927 insertions(+), 8 deletions(-)

diffs (truncated from 1092 to 300 lines):

diff -r 496973a99d8c -r daef80317129 share/man/man9/heartbeat.9
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/share/man/man9/heartbeat.9        Fri Jul 07 12:34:49 2023 +0000
@@ -0,0 +1,169 @@
+.\"    $NetBSD: heartbeat.9,v 1.1 2023/07/07 12:34:49 riastradh Exp $
+.\"
+.\" Copyright (c) 2023 The NetBSD Foundation, Inc.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\"    notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\"    notice, this list of conditions and the following disclaimer in the
+.\"    documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS
+.\" ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
+.\" TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+.\" PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS
+.\" BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+.\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+.\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+.\" INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+.\" CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+.\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+.\" POSSIBILITY OF SUCH DAMAGE.
+.\"
+.Dd July 6, 2023
+.Dt HEARTBEAT 9
+.Os
+.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.Sh NAME
+.Nm heartbeat
+.Nd periodic checks to ensure CPUs are making progress
+.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.Sh SYNOPSIS
+.Fd "options   HEARTBEAT"
+.Fd "options   HEARTBEAT_MAX_PERIOD_DEFAULT=15"
+.\"
+.In sys/heartbeat.h
+.\"
+.Ft void
+.Fn heartbeat_start void
+.Ft void
+.Fn heartbeat void
+.Ft void
+.Fn heartbeat_suspend void
+.Ft void
+.Fn heartbeat_resume void
+.Fd "#ifdef DDB"
+.Ft void
+.Fn heartbeat_dump void
+.Fd "#endif"
+.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.Sh DESCRIPTION
+The
+.Nm
+subsystem verifies that soft interrupts
+.Pq Xr softint 9
+and the system
+.Xr timecounter 9
+are making progress over time, and panics if they appear stuck.
+.Pp
+The number of seconds before
+.Nm
+panics without progress is controlled by the sysctl knob
+.Li kern.heartbeat.max_period ,
+which defaults to 15.
+If set to zero, heartbeat checks are disabled.
+.Pp
+The periodic hardware timer interrupt handler calls
+.Fn heartbeat
+every tick on each CPU.
+Once per second
+.Po
+i.e., every
+.Xr hz 9
+ticks
+.Pc ,
+.Fn heartbeat
+schedules a soft interrupt at priority
+.Dv SOFTINT_CLOCK
+to advance the current CPU's view of
+.Xr time_uptime 9 .
+.Pp
+.Fn heartbeat
+checks whether
+.Xr time_uptime 9
+has changed, to see if either the
+.Xr timecounter 9
+or soft intrrupts on the current CPU are stuck.
+If it hasn't advanced within
+.Li kern.heartbeat.max_period
+seconds worth of ticks, or if it has updated and the current CPU's view
+of it hasn't been updated by more than
+.Li kern.heartbeat.max_period
+seconds, then
+.Fn heartbeat
+panics.
+.Pp
+.Fn heartbeat
+also checks whether the next online CPU has advanced its view of
+.Xr time_uptime 9 ,
+to see if soft interrupts
+.Pq including Xr callout 9
+on that CPU are stuck.
+If it hasn't updated within
+.Li kern.heartbeat.max_period
+seconds,
+.Fn heartbeat
+sends an
+.Xr ipi 9
+to panic on that CPU.
+If that CPU has not acknowledged the
+.Xr ipi 9
+within one second,
+.Fn heartbeat
+panics on the current CPU instead.
+.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.Sh FUNCTIONS
+.Bl -tag -width Fn
+.It Fn heartbeat
+Check for timecounter and soft interrupt progress on this CPU and on
+another CPU, and schedule a soft interrupt to advance this CPU's view
+of timecounter progress.
+.Pp
+Called by
+.Xr hardclock 9
+periodically.
+.It Fn heartbeat_dump
+Print all the heartbeat counter, uptime cache, and uptime cache
+timestamp (in units of heartbeats) to the console.
+.Pp
+Can be invoked from
+.Xr ddb 9
+by
+.Ql call heartbeat_dump .
+.It Fn heartbeat_resume
+Resume heartbeat monitoring of the current CPU.
+.Pp
+Called after a CPU has started running but before it has been
+marked online.
+.It Fn heartbeat_start
+Start monitoring heartbeats systemwide.
+.Pp
+Called by
+.Xr main 9
+as soon as soft interrupts can be established.
+.It Fn heartbeat_suspend
+Suspend heartbeat monitoring of the current CPU.
+.Pp
+Called after the current CPU has been marked offline but before it has
+stopped running.
+.El
+.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.Sh CODE REFERENCES
+The
+.Nm
+subsystem is implemented in
+.Pa sys/kern/kern_heartbeat.c .
+.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.Sh SEE ALSO
+.Xr wdogctl 8 ,
+.Xr swwdog 4
+.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.Sh HISTORY
+The
+.Nm
+subsystem first appeared in
+.Nx 11.0 .
diff -r 496973a99d8c -r daef80317129 sys/kern/files.kern
--- a/sys/kern/files.kern       Fri Jul 07 12:34:26 2023 +0000
+++ b/sys/kern/files.kern       Fri Jul 07 12:34:49 2023 +0000
@@ -1,13 +1,15 @@
-#      $NetBSD: files.kern,v 1.57 2021/09/19 15:51:27 thorpej Exp $
+#      $NetBSD: files.kern,v 1.58 2023/07/07 12:34:50 riastradh Exp $
 
 #
 # kernel sources
 #
 define kern:   cprng_fast, machdep, uvm
+defflag        opt_heartbeat.h                 HEARTBEAT
 defflag        opt_kern.h                      KERN
 defflag        opt_script.h                    SETUIDSCRIPTS FDSCRIPTS
 defflag                                        KASLR
 defparam opt_cnmagic.h                 CNMAGIC
+defparam heartbeat.h                   HEARTBEAT_MAX_PERIOD_DEFAULT
 
 file   conf/debugsyms.c                kern
 file   conf/param.c                    kern
@@ -48,6 +50,7 @@ file  kern/kern_exec.c                kern
 file   kern/kern_exit.c                kern
 file   kern/kern_fork.c                kern
 file   kern/kern_idle.c                kern
+file   kern/kern_heartbeat.c           kern & heartbeat
 file   kern/kern_hook.c                kern
 file   kern/kern_kthread.c             kern
 file   kern/kern_ktrace.c              ktrace
diff -r 496973a99d8c -r daef80317129 sys/kern/init_main.c
--- a/sys/kern/init_main.c      Fri Jul 07 12:34:26 2023 +0000
+++ b/sys/kern/init_main.c      Fri Jul 07 12:34:49 2023 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: init_main.c,v 1.541 2022/10/26 23:20:47 riastradh Exp $        */
+/*     $NetBSD: init_main.c,v 1.542 2023/07/07 12:34:50 riastradh Exp $        */
 
 /*-
  * Copyright (c) 2008, 2009, 2019 The NetBSD Foundation, Inc.
@@ -97,10 +97,11 @@
  */
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: init_main.c,v 1.541 2022/10/26 23:20:47 riastradh Exp $");
+__KERNEL_RCSID(0, "$NetBSD: init_main.c,v 1.542 2023/07/07 12:34:50 riastradh Exp $");
 
 #include "opt_cnmagic.h"
 #include "opt_ddb.h"
+#include "opt_heartbeat.h"
 #include "opt_inet.h"
 #include "opt_ipsec.h"
 #include "opt_modular.h"
@@ -199,6 +200,7 @@ extern void *_binary_splash_image_end;
 #include <sys/cprng.h>
 #include <sys/psref.h>
 #include <sys/radixtree.h>
+#include <sys/heartbeat.h>
 
 #include <sys/syscall.h>
 #include <sys/syscallargs.h>
@@ -557,6 +559,14 @@ main(void)
        /* Once all CPUs are detected, initialize the per-CPU cprng_fast.  */
        cprng_fast_init();
 
+#ifdef HEARTBEAT
+       /*
+        * Now that softints can be established, start monitoring
+        * system heartbeat on all CPUs.
+        */
+       heartbeat_start();
+#endif
+
        ssp_init();
 
        ubc_init();             /* must be after autoconfig */
diff -r 496973a99d8c -r daef80317129 sys/kern/kern_clock.c
--- a/sys/kern/kern_clock.c     Fri Jul 07 12:34:26 2023 +0000
+++ b/sys/kern/kern_clock.c     Fri Jul 07 12:34:49 2023 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: kern_clock.c,v 1.149 2023/06/30 21:42:05 riastradh Exp $       */
+/*     $NetBSD: kern_clock.c,v 1.150 2023/07/07 12:34:50 riastradh Exp $       */
 
 /*-
  * Copyright (c) 2000, 2004, 2006, 2007, 2008 The NetBSD Foundation, Inc.
@@ -69,11 +69,12 @@
  */
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: kern_clock.c,v 1.149 2023/06/30 21:42:05 riastradh Exp $");
+__KERNEL_RCSID(0, "$NetBSD: kern_clock.c,v 1.150 2023/07/07 12:34:50 riastradh Exp $");
 
 #ifdef _KERNEL_OPT
 #include "opt_dtrace.h"
 #include "opt_gprof.h"
+#include "opt_heartbeat.h"
 #include "opt_multiprocessor.h"
 #endif
 
@@ -92,6 +93,7 @@
 #include <sys/cpu.h>
 #include <sys/atomic.h>
 #include <sys/rndsource.h>
+#include <sys/heartbeat.h>
 
 #ifdef GPROF
 #include <sys/gmon.h>
@@ -335,6 +337,13 @@ hardclock(struct clockframe *frame)
                tc_ticktock();
        }
 
+#ifdef HEARTBEAT
+       /*
+        * Make sure the CPUs and timecounter are making progress.
+        */
+       heartbeat();
+#endif
+
        /*
         * Update real-time timeout queue.
         */
diff -r 496973a99d8c -r daef80317129 sys/kern/kern_cpu.c
--- a/sys/kern/kern_cpu.c       Fri Jul 07 12:34:26 2023 +0000
+++ b/sys/kern/kern_cpu.c       Fri Jul 07 12:34:49 2023 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: kern_cpu.c,v 1.94 2023/02/26 07:13:55 skrll Exp $      */
+/*     $NetBSD: kern_cpu.c,v 1.95 2023/07/07 12:34:50 riastradh Exp $  */
 
 /*-
  * Copyright (c) 2007, 2008, 2009, 2010, 2012, 2019 The NetBSD Foundation, Inc.



Home | Main Index | Thread Index | Old Index