An idea for a feature similar to KMEM_GUARD - which I recently removed because
it was too weak and useless -, but this time at the pool layer, covering certain
specific pools, without memory consumption or performance cost, and enabled by
default at least on amd64. Note that this is hardening and exploit mitigation,
but not bug detection, so it will be of little interest in the context of
fuzzing. Note also that it targets 64bit arches, because they have nearly
unlimited VA.
The idea is that we can use special guard allocators on certain pools to prevent
important kernel data from being close to untrusted data in the VA space.
Suppose the kernel is parsing a received packet from the network, and there
is a buffer overflow which causes it to write beyond the mbuf. The data is in
an mbuf cluster of size 2K (on amd64). This mbuf cluster sits on a 4K page
allocated using the default pool allocator. After the 4K page in memory, there
could be critical kernel data sitting, which an attacker could overwrite.
overflow
--------------------------->
+------------+------------+----------------------+
| 2K Cluster | 2K Cluster | Critical Kernel Data |
+------------+------------+----------------------+
<- usual 4K pool page --> <- another 4K page -->
This is a scenario that I already encountered when working on NetBSD's network
stack.
Now, we switch the mcl pool to use the new uvm_km_guard API (simple wrappers
to allocate buffers with unmapped pages at the beginning and the end). The pool
layer sees pages of size 128K, and packs 64 2K clusters in them.
overflow
------------~~~~~>
+------------+-------+-------+-------+-------+-------+------------+
| Unmapped | 2K C. | 2K C. | [...] | 2K C. | 2K C. | Unmapped |
+------------+-------+-------+-------+-------+-------+------------+
<-- 64K ---> <-- 128K pool page with 64 clusters --> <-- 64K --->
The pool page header is off-page, and bitmapped. Therefore, there is strictly
no kernel data in the 128K pool page.
The overflow still occurs, but this time the critical kernel data is far from
here, after the unmapped pages at the end. At worst only other clusters get
overwritten; at best we are close to the end and hit a page fault which stops
the overflow. 64K is chosen as the maximum of uint16_t.
No performance cost, because these guarded buffers are allocated only when the
pools grow, which is a rare operation that occurs almost only at boot time. No
actual memory consumption either, because unmapped areas don't consume physical
memory, only virtual, and on 64bit arches we have plenty of that - eg 32TB on
amd64, far beyond what we will ever need -, so no problem with consuming VA.
The code is here [1] for mcl, it is simple and works fine. It is not perfect
but can already prevent a lot of trouble. The principle could be applied to
other pools.
[1] https://m00nbsd.net/garbage/pool/guard.diff