Hunting kernel lock and interrupt latency

To: tech-kern%NetBSD.org@localhost
Subject: Hunting kernel lock and interrupt latency
From: Taylor R Campbell <campbell+netbsd-tech-kern%mumble.net@localhost>
Date: Sun, 30 Oct 2022 12:30:38 +0000
Too much of the kernel still runs with the kernel lock, which is bad
for performance -- and especially for interactive system
responsiveness, since most of wscons including keyboard and mouse
input still use the kernel lock.  On single-CPU systems, the same
problem arises with interrupt latency on its own.

So I've been trying to track down kernel lock abusers and long-running
interrupt handlers.  I recently added some dtrace probes to help find
these.  Here are some methods I've used:

# Show what caller was hogging the kernel lock, at the time they
# unlock it, with the maximum duration held in nanoseconds.  (This
# doesn't count time sleeping, during which the kernel lock is
# released.)  Worst offenders on my laptop recently, holding the
# kernel lock for more than half a millisecond:
#
#	netbsd`fileassoc_file_delete+0xbe 549281
#	netbsd`tap_fops_write+0x31 741863
#	netbsd`VOP_UNLOCK+0x9b 1108718
#	netbsd`cdev_write+0x96 1307875
#	netbsd`VOP_LOCK+0xee 1321592
#	netbsd`softint_dispatch+0x348 137520209
#	netbsd`intr_biglock_wrapper+0x58 459787274
#
# fileassoc_file_delete should not be here because I'm not using
# veriexec(4).  A quick skim of kern_fileassoc.c shows that its
# locking is broken and it takes the kernel lock in a critical path
# for unlink(2) where it shouldn't be taking _any_ locks because it's
# not even being used.  The vops, cdev, softint, and intr cases all
# need a little more investigation to see who's got this much latency!
#
# (This sample might be slightly skewed because I sampled
# fbt::_kernel_lock:entry, so it included time waiting for the kernel
# lock; the following script uses fbt::_kernel_lock:return to avoid
# this.)
#
dtrace -b 32m -x cleanrate=50hz -n '
	fbt::_kernel_lock:return {
		self->starttime = timestamp
	}
	fbt::_kernel_unlock:entry /self->starttime/ {
		@[caller] = max(timestamp - self->starttime);
		self->starttime = 0
	}
	tick-1s, END {
		printa("\n%a %@u", @)
	}
'

# Show which file systems have vnodes without VV_MPSAFE (0x10) that
# we're locking and how often.  On my laptop this is all ptyfs:
#
#	  ptyfs                                                          2586
#
# The other file systems I'm using -- ffs, nullfs, procfs, tmpfs, and
# (mostly!) zfs -- are all MP-safe already.  ptyfs may not be a
# _throughput_ bottleneck in most cases (except when dumping large
# quantities of command output to a pty), but it may be important for
# interactive latency in terminal emulators.
#
# We can also change this to aggregate with @[stack()] instead of
# @[stringof ...f_fstypename] to get a histogram of stack traces where
# this is happening, which shows mostly ptyfs_write and ptyfs_read and
# their callers on my laptop.
#
dtrace -n '
	fbt::VOP_LOCK:entry /(args[0]->v_vflag & 0x10) == 0/ {
		@[stringof args[0]->v_mount->mnt_stat.f_fstypename] = count()
	}
	tick-10s {
		printa(@)
	}
'

# On x86, show a histogram of interrupt latency for interrupt handlers
# that wait for and take the kernel lock.  For example, this showed:
#
#	netbsd`AcpiEvSciXruptHandler,
#	           value  ------------- Distribution ------------- count
#	           32768 |                                         0
#	           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@             117125
#	          131072 |@@@@@@@@@@@@                             49904
#	          262144 |@                                        3080
#	          524288 |                                         0
#	
# It turns out that the ACPI interrupt handlers all have their own
# locking anyway, so I marked them MP-safe on x86!  I also did the
# same for ehci(4) and xhci(4) USB host controller interfaces after a
# quick audit.
#
# This also turned up pckbcintr (PS/2 keyboard/mouse driver for my
# laptop's internal keyboard and trackpad) and gen8_irq_handler (i915
# drm graphics interrupts), with latency sometimes between a quarter
# and half a second(!) during builds.  (Note that this includes time
# waiting for the kernel lock, so it's not necessarily the interrupt
# handler's fault directly that the latency rose to half a second.)
#
# We're experimenting in HEAD now with removing the kernel lock on drm
# pci interrupts (since the drm drivers all have their own internal
# locking anyway), but pckbcintr will require more work to make
# pckbport and wskbd/wsmouse/wsevent MP-safe.
#
dtrace -b 32m -x cleanrate=50hz -n '
	fbt::intr_biglock_wrapper:entry {
		self->starttime = timestamp;
		self->func = ((struct intrhand *)arg0)->ih_realfun;
	}
	fbt::intr_biglock_wrapper:return /self->starttime/ {
		@[self->func] = quantize(timestamp - self->starttime);
		self->starttime = 0;
		self->func = 0
	}
	tick-1s, END {
		printa("\n%a, %@u", @)
	}
'

# Show latency of softints and callouts with new SDT probes I added
# yesterday.  Haven't run this on my laptop yet, but in an aarch64 VM
# here's what it shows after a few seconds:
#
#	netbsd`gpiokeys_tick 92336
#	netbsd`rt_timer_timer 100272
#	netbsd`if_slowtimo_intr 107248
#	netbsd`key_timehandler 116816
#	netbsd`nd6_timer 170064
#	netbsd`sleepq_timeout 206240
#	netbsd`ttysigintr 268064
#	netbsd`callout_softclock 795520
#	netbsd`plcomsoft 9116672
#
# plcomsoft is the serial console softint, which runs kernel-locked,
# and it sometimes takes almost 10ms!  It may not be holding the
# kernel lock for all 10ms in case it sleeps for an adaptive mutex
# during part of that, but if it sleeps it holds up all other softints
# -- and thus also callouts -- on the same CPU.
#
dtrace -n '
	sdt:kernel:softint:entry, sdt:kernel:callout:entry {
		self->starttime = timestamp
	}
	sdt:kernel:softint:return, sdt:kernel:callout:return
			/self->starttime/ {
		@[arg1] = max(timestamp - self->starttime);
		self->starttime = 0
	}
	tick-1s, END {
		printa("\n%a %@u", @)
	}
'

# x86 and aarch64 also now have similar SDT probes for hard interrupt
# handlers, sdt:kernel:intr:entry/return.  Note: for the hard
# interrupt handler SDT probes, the interrupt handler function is
# arg0, not arg1 (maybe I should change these so they agree for hard
# intr (arg0), softint (arg1), callout (arg1), and workqueue (arg2)).

From this, it looks like the next things to prioritize making MP-safe
for better system responsiveness, at least on my laptop, are:

1. pckbcintr, including pckbport/wskbd/wsmouse/wsevent
2. plcom(4) and com(4)
3. ptyfs
4. tap(4)
5. get rid of fileassoc_file_delete and/or fix the locking

Surprisingly, nothing i915drm-related turned up in here even though
all the drm ioctls are kernel-locked -- the worst offender that
_might_ have been i915 was

netbsd`cdev_ioctl+0xad 166316

which isn't bad at all (166us waiting for or holding kernel lock).

iwm_intr also turned up with up to half a second of latency, but
that'll have to be deferred until the wifi update that martin's
working on.

But maybe on your system, with the help of these dtrace fragments,
you'll find other culprits of kernel lock or interrupt latency abuse!
Follow-Ups:
- Re: Hunting kernel lock and interrupt latency
  - From: nia
Prev by Date: malloc(9) vs kmem(9) interfaces
Next by Date: Re: Restructuring inpcb/in6pcb
Previous by Thread: malloc(9) vs kmem(9) interfaces
Next by Thread: Re: Hunting kernel lock and interrupt latency
Indexes:
Home | Main Index | Thread Index | Old Index