tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Likely lock botch in NPF firewall
Taylor R Campbell <campbell+netbsd-tech-kern%mumble.net@localhost> writes:
>> Date: Wed, 11 Jan 2023 13:05:04 -0500
>> From: Brad Spencer <brad%anduin.eldar.org@localhost>
>>
>> I think I know what the root problem is for kern/57136 and
>> kern/57181... a couple of PRs I wrote about problems I have been having
>> with NPF, but I am not at all sure that my solution is correct.
Thanks for the response....
> These are the same issue. Your analysis is correct that this happens
> because the spin lock t_lock is held across copyout, which is
> forbidden, because copyout might sleep indefinitely and sleeping at
> all (let alone indefinitely) under a spin lock is forbidden.
>
> The kernel with LOCKDEBUG just goes out of its way to detect the
> problem early, whereas the non-LOCKDEBUG kernel mostly does the
> copyout without sleeping so it gets by without triggering the panic.
>
> However, simply changing the lock to be IPL_SOFTNET isn't enough.
> Holding an IPL_SOFTNET lock across copyout is also forbidden (but
> LOCKDEBUG might not detect it). It is forbidden for anything in hard
> or soft interrupt context to block for copyout (which might be
> happening in another thread).
I mostly picked IPL_SOFTNET from the statements made in mutex(9). It
seems to state the IPL_SOFTNET is one of the adaptative locks as are all
IPL_SOFT* locks, that is, it isn't a spin lock.
[sorry for what is likely to be dumb questions]
Is what you mean by "forbidden for anything in hard or soft interrupt
context to block for copyout" to mean holding any mutex across copyout??
(This is an area I am pretty dumb about and am not entirely sure what I
am doing). I could not find anything in the man pages about holding
mutexs around copyout. (That is, if it IS NOT an interrupt context can
you hold a mutex during a copyout).
The case that messes with me the most is the listing a table ioctl. If
that ioctl manages to get the IPL_SOFTNET lock, nothing else, even if it
is interrupt context of some sort, should have it and it won't get the
lock until it is available?? (or am I missing something)... or is a
ioctl itself some sort of interrupt context or is it a the case of just
not allowed to hold any sort of mutex across a copyout at all?? (like I
said, don't really understand this that well).
> This code needs to coordinate the table iteration and copyout with
> other access (insert/remove/lookup) in another way. Probably the
> easiest way will be to create an rwlock, and use t_gc for lpm-type
> tables in addition to thmap-type tables:
>
> - npf_table_remove moves lpm-type entries to the t_gc list instead of
> freeing them immediately
> - npf_table_gc takes a write lock and frees the t_gc entries
> - npf_table_list takes a read lock (and pserialize if needed to grab
> pointers to entries) around copyout -- never takes t_lock
>
> This could still potentially result in deadlock scenarios, if
> npf_table_gc is ever reached from the pagedaemon; if so, npf_table_gc
> could do rw_tryenter instead of rw_enter to avoid this deadlock, at
> the cost of sometimes delaying the freeing of table entries.
I understand the code well enough to read it some, but probably not well
enough to implement what you suggest. As it is right now, this is a bad
bug. The abuse test I came up with that doesn't use LOCKDEBUG trips a
panic by hitting two xbd drives pretty hard while doing a npf table list
but I think that other activity will trigger it. I get 2 to 6 days
between panics right now. With the simple patch I could not cause the
abuse panic in my test DOMU that I was able to without it. For me right
now that may be an improvement if it helps the firewall / router not
panic.
--
Brad Spencer - brad%anduin.eldar.org@localhost - KC8VKS - http://anduin.eldar.org
Home |
Main Index |
Thread Index |
Old Index