NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/59411 (deadlock on mbuf pool)
The following reply was made to PR kern/59411; it has been noted by GNATS.
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
To: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
Cc: gnats-bugs%netbsd.org@localhost, kern-bug-people%netbsd.org@localhost,
netbsd-bugs%netbsd.org@localhost, gnats-admin%netbsd.org@localhost,
mrg%NetBSD.org@localhost, chs%NetBSD.org@localhost, christos%NetBSD.org@localhost, tnn%NetBSD.org@localhost,
wiz%NetBSD.org@localhost
Subject: Re: kern/59411 (deadlock on mbuf pool)
Date: Mon, 19 May 2025 13:00:29 +0000
> Date: Mon, 19 May 2025 10:17:57 +0200
> From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
>=20
> Back to the initial patch in this PR.
>=20
> I guess running the whole pool_grow() at splvm(), including the busy_wait
> on PR_GROWINGNOWAIT would work (then we could busy-wait even when called =
from
> interrupt context), but I'm not sure it's better than my initial patch.
Unfortunately, this won't work because mutex_exit will restore spl to
what it was at mutex_enter (unless it was already raised in
mutex_enter by holding another spin lock), even if you try something
like:
+ s =3D splraiseipl(pp->pr_ipl);
mutex_exit(&pp->pr_lock);
storage =3D pool_allocator_alloc(pp, flags);
...
mutex_enter(&pp->pr_lock);
+ splx(s);
> My patch will return failure for more cases, but the called should be able
> to deal with that anyway.
I think your initial patch (assuming you mean 1.293, and then just
deleting the whole PR_GROWINGNOWAIT machinery) is likely to be asking
for trouble by continuing to hold the lock across the backing
allocator -- and, perhaps worse, across the backing allocator's free
routine, which sometimes issues a cross-call that requires all other
CPUs to be responsive.
The PR_GROWINGNOWAIT change addressed a real problem _without_ holding
the lock across the backing allocator. Unfortunately, I don't
remember how to provoke the problem, or where there is a record of the
symptoms; perhaps wiz@ does?
Based on the commit message, it sounds like the system would sometimes
fail to boot up, with a crash in init, because of concurrent pool
growth despite having lots of memory available.
This presumably _started_ to happen after I made sure only one thread
at a time could do pool_grow, in order to avoid the situation where
multiple threads doing simultaneous allocation lead to multiple
backing pages with only one object apiece, severely fragmenting kva:
PR kern/45718: processes sometimes get stuck and spin in vm_map
https://mail-index.NetBSD.org/tech-kern/2017/10/23/msg022472.html
Now, the PR_GROWINGNOWAIT mechanism is growing into an increasingly
grody hack, certainly, but this means we need to carefully revisit the
design rather than just blithlely create more deadlocks like christos@
did back in 2017.
Home |
Main Index |
Thread Index |
Old Index