kern/59870: kernel lock runtime diagnostics are difficult

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/59870: kernel lock runtime diagnostics are difficult
From: campbell+netbsd%mumble.net@localhost
Date: Wed, 31 Dec 2025 04:20:01 +0000 (UTC)

>Number:         59870
>Category:       kern
>Synopsis:       kernel lock runtime diagnostics are difficult
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Dec 31 04:20:00 +0000 2025
>Originator:     Taylor R Campbell
>Release:        current, 11, 10, 9, ...
>Organization:
The NetBSD Locker, Inc.
>Environment:
>Description:

	Sometimes the legacy kernel lock is held for an unreasonably
	long time.

	How do you tell who holds it when this happens?  If you're
	lucky, you can enter ddb and find the threads running on each
	cpuN with `ps' and switch to `mach cpu N' and run `bt' to find
	a code path that obviously holds the kernel lock.

	If you're not lucky, you have a heartbeat panic because some
	softint tried to take the kernel lock and waited too long for
	it and crash dumps failed because suspendsched has
	KASSERT(!cpu_intr_p()) and the heartbeat panic happens within
	an interrupt handler, as in:

	https://mail-index.NetBSD.org/current-users/2025/12/27/msg047183.html

	If you have enabled LOCKDEBUG, and the kernel lock is held for
	more than 10sec, you get a kernel lock spinout and an IPI is
	sent to the hogging CPU to geta stack trace.  But since
	autoconf(9) runs kernel-locked, loading a module for a driver
	can trigger this panic.

	It's also annoying when the kernel lock is held for enough time
	to make the system flaky (partly because, e.g., wscons(4) and
	pckbport(4) run with it, and so do some network drivers like
	iwm(4)), but not enough to trigger other diagnostics.  However,
	attempts to dtrace the kernel_lock function, along the lines of
	https://mail-index.netbsd.org/tech-kern/2022/10/30/msg028499.html,
	only show that it was taken in sleepq_block because something
	that held the kernel lock slept and then woke up again.

>How-To-Repeat:

	- chase bad interactive system latency due to kernel lock hogs
	- try to diagnose panics like
          https://mail-index.netbsd.org/current-users/2025/12/27/msg047183.html

>Fix:

	1. Enable the logic to provoke an IPI to dump a stack trace
	   _without_ LOCKDEBUG.

	2. Pass a cookie across the unlock/sleep/relock logic so that
	   dtrace can tell on whose behalf the relock happened.

	Other ideas welcome!

Prev by Date: Re: lib/59564 (ELF source code compatibility regression)
Next by Date: Re: kern/59675 (pppoe crashes kernel)
Previous by Thread: Re: bin/59570
Next by Thread: Re: kern/59870: kernel lock runtime diagnostics are difficult
Indexes:

Home | Main Index | Thread Index | Old Index