NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/59870: kernel lock runtime diagnostics are difficult



>Number:         59870
>Category:       kern
>Synopsis:       kernel lock runtime diagnostics are difficult
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Dec 31 04:20:00 +0000 2025
>Originator:     Taylor R Campbell
>Release:        current, 11, 10, 9, ...
>Organization:
The NetBSD Locker, Inc.
>Environment:
>Description:

	Sometimes the legacy kernel lock is held for an unreasonably
	long time.

	How do you tell who holds it when this happens?  If you're
	lucky, you can enter ddb and find the threads running on each
	cpuN with `ps' and switch to `mach cpu N' and run `bt' to find
	a code path that obviously holds the kernel lock.

	If you're not lucky, you have a heartbeat panic because some
	softint tried to take the kernel lock and waited too long for
	it and crash dumps failed because suspendsched has
	KASSERT(!cpu_intr_p()) and the heartbeat panic happens within
	an interrupt handler, as in:

	https://mail-index.NetBSD.org/current-users/2025/12/27/msg047183.html

	If you have enabled LOCKDEBUG, and the kernel lock is held for
	more than 10sec, you get a kernel lock spinout and an IPI is
	sent to the hogging CPU to geta stack trace.  But since
	autoconf(9) runs kernel-locked, loading a module for a driver
	can trigger this panic.

	It's also annoying when the kernel lock is held for enough time
	to make the system flaky (partly because, e.g., wscons(4) and
	pckbport(4) run with it, and so do some network drivers like
	iwm(4)), but not enough to trigger other diagnostics.  However,
	attempts to dtrace the kernel_lock function, along the lines of
	https://mail-index.netbsd.org/tech-kern/2022/10/30/msg028499.html,
	only show that it was taken in sleepq_block because something
	that held the kernel lock slept and then woke up again.

>How-To-Repeat:

	- chase bad interactive system latency due to kernel lock hogs
	- try to diagnose panics like
          https://mail-index.netbsd.org/current-users/2025/12/27/msg047183.html

>Fix:

	1. Enable the logic to provoke an IPI to dump a stack trace
	   _without_ LOCKDEBUG.

	2. Pass a cookie across the unlock/sleep/relock logic so that
	   dtrace can tell on whose behalf the relock happened.

	Other ideas welcome!



Home | Main Index | Thread Index | Old Index