tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Likely lock botch in NPF firewall




Hello...

I think I know what the root problem is for kern/57136 and
kern/57181... a couple of PRs I wrote about problems I have been having
with NPF, but I am not at all sure that my solution is correct.

The problem is that it seems that a spin lock is being and then an
action is performed that is illegal while holding a spin lock.  In
kern/57136 a panic happens when a CPU switch occurs (I think), and
kern/57181 (a much simpler panic to reproduce) the panic happens when
LOCKDEBUG is defined and certain NPF operations are performed.  What I
think is happening is that a mutex is created per NPF table held at ipl
IPL_NET (see the use of &t->t_lock in src/sys/net/npf/npf_tableset.c).
This would be a spin lock (I think).  This lock is held during NPF
operation such as "npfctl table <table> list" and that ioctl can perform
a copyout which I believe is against the rules when holding a spin lock
(there may be other copyout instances too, but the table list is the one
that is messing with me).  This seems to be strongly hinted at in the
LOCKDEBUG PR, but I don't fully understand everything I see there.

My simple patch to address this is this:

--- npf_tableset.c.DIST 2022-04-09 19:38:33.000000000 -0400
+++ npf_tableset.c      2023-01-11 09:26:30.627895981 -0500
@@ -410,7 +410,8 @@
        default:
                KASSERT(false);
        }
-       mutex_init(&t->t_lock, MUTEX_DEFAULT, IPL_NET);
+       /* mutex_init(&t->t_lock, MUTEX_DEFAULT, IPL_NET); */
+       mutex_init(&t->t_lock, MUTEX_DEFAULT, IPL_SOFTNET);
        t->t_type = type;
        t->t_id = tid;
        return t;


This patch seems to work alright, the LOCKDEBUG panic is gone addressing
kern/57181 and the abusive test I found which can trigger kern/57136
appears to also be gone (i.e. I don't have to wait 6 days to see the
panic, I have managed to come up with a way to produce it in a few
minutes).  What I do not know is why this mutex was IPL_NET in the first
place (the comments mention that it is so, but not why it is so).  I
compiled a kernel for the firewall that was panicing with this patch and
it appears to still appears to function as a firewall / router just
fine, which suggests that lock need not have been a spin lock in the
first place (at least with Xen DOMUs).

While I am running Xen DOMUs everywhere at least kern/57181 (the
LOCKDEBUG PR) effects everything.  I managed to panic a Virtualbox "bare
metal" system using the GENERIC kernel, for example.  This suggests that
right now the use of NPF with tables is probably a bit dangerous.  This
problem is also probably racing in some way, as I have a couple of other
DOMUs that use NPF as a firewall and use NPF tables that do not panic at
all.  They do exactly the same operations, more or less (with a
different /etc/npf.conf) as the one that does panic regularly.

Any advise or hints about this would be appreciated.  I am honestly a
bit in the dark as to some of what may be going on here.






-- 
Brad Spencer - brad%anduin.eldar.org@localhost - KC8VKS - http://anduin.eldar.org



Home | Main Index | Thread Index | Old Index