NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/60189: physio() is unnecessarily brittle
>Number: 60189
>Category: kern
>Synopsis: physio() is unnecessarily brittle
>Confidential: no
>Severity: non-critical
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Apr 13 04:25:00 +0000 2026
>Originator: Jason Thorpe
>Release: NetBSD 11 (and 10, and 9, and ...)
>Organization:
The Society for Aspiring Unix Grey-beards
>Environment:
>Description:
The physio() facility for doing disk I/O directly to user buffers works
by looping over the uio encapsulating the user's I/O request and iteratively
using uvm_vslock() (to wire the user's pages) and vmapbuf() (to map those
pages into kernel space), performing the I/O, followed by vunmapbuf() and
uvm_vsunlock() in physio_done(). It's been done this way in one form or
another practically forever and, for the most part, works pretty well.
However, there is an ineffeciency built in to this mechanism that also
makes it less robust than it could be. Specifically, it requires that,
in addition to the backing page, the user's *mapping* of that page to
be wired so that vmapbuf() can extract the PA of the page and create
a kernel mapping for it. While the implementation of vmapbuf() is
considered to architecture-dependent, all implementations are built around
a pmap_extract() (to get the PA from the user's pmap) -> pmap_enter()
(to enter the kernel mapping for that PA) loop, and if the pmap_extract()
fails, vmapbuf() panics.
This is the brittleness -- because the mapping is supposed to be wired,
the expectation is that pmap_extract() will not fail. However, there is
a practical problem: it's not always *possible* to have truly wired
user space mappings even if the pages backing those mappings are themselves
wired.
Consider a system like the Sun2: the MMU has a fixed number of mapping
resources (contexts and PMEGs) that are baked into the hardware and thus
*must* be shared by all processes. The kernel gets its own context and
its own PMEGs, but user processes *must* fight for what remains. As such,
it cannot honor wired mapping requests for user pmaps, because doing so
would make it way too easy to bring the system to a screeching halt. The
same situation exists in the Sun3 pmap, and comments in both allude to the
authors of those modules being aware that this could be a problem. There
are ways to mitigate this (e.g. use the wired-ness of mappings as a weighting
factor when selecting a PMEG to steal), but this isn't really a solution;
ultimately, wired user mappings are at best a hint and the resources *must*
be shared, and it's always going to be possible to construct a situation
where a wired mapping is going to get nuked under resource pressure.
What's silly about this is that we don't really need the pmap layer to
preserve the PA of the user's physical pages (which themselves remain
wired even if the mapping doesn't) at all, because UVM already knows what
they are, and in fact, in the process of performing uvm_vslock()'s work,
has visited the page structure for each of them to increment the wire
count. The physical addresses of the pages are **right there**, and there
is no need to ask the pmap layer to preserve them other than lack of
sufficient plumbing.
I recently encoutered this situation with the Sun2. A prior un-clean
shutdown resulted in an fsck having to be performed, and for whatever
reason (perhaps the installer just set up fstab this way?), multiple fscks
where allowed to run in parallel after the root fs was checked. This
put the system under mapping resource pressure, and because each fsck
was performing physio, the bomb was aremed and at some point, a PMEG that
held a wired-for-physio mapping had to get forcefully recycled (possibly
to wire pages for another fsck) and vmapbuf() blew up as a result.
It was possible to address this in the moment by booting to single user,
fsck'ing everything manually, and then editing fstab to remove any
fsck parallelism.
However, the vulnerablity remains, and this report serves to document it
and anchor a dicussion about how to address it.
>How-To-Repeat:
Parallel fscks on a Sun2 or Sun3 with a sufficient number of disks / partitions will trigger this problem pretty easily.
>Fix:
Suggested solution: a form of uvm_fault_wire() that can enter mappings for the user pages into an already-allocated physio address range as the pages are wired (and thus PAs of those pages thus readily available without having to ask the pmap to preserve them).
Home |
Main Index |
Thread Index |
Old Index