tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: deadlock between KERNEL_LOCK and a mutex ?



> Date: Mon, 5 May 2025 18:08:19 +0200
> From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
> 
> still trying to debug panics/hangs on a heavily loaded web server

What kernel version?

> I got a hard hang;

What does `hard hang' mean?  Is there there a heartbeat panic?  Can
you share the full output of ps, ps/w, and show all tstiles?  And can
you show the stack traces for all CPUs with `mach cpu N'?

> db{0}> mach cpu 2
> using CPU 2
> db{0}> tr
> _kernel_lock() at netbsd:_kernel_lock+0xd5
> mb_drain() at netbsd:mb_drain+0x17    
> pool_grow() at netbsd:pool_grow+0x3b9 
> pool_get() at netbsd:pool_get+0x3c7   
> [...]
> 
> I wonder if we can have a deadlock here: CPU 2 holds mbuf pool's lock and
> tries to get _kernel_lock(). It looks like the softint thread on CPU 0
> holds the kernel_lock (as it's not running with NET_MPSAFE) and tries
> to get the mbuf pool's lock.

This deadlock doesn't make sense because we drop the pool lock around
the drain hook (mb_drain):

   1129 			/*
   1130 			 * Since the drain hook is going to free things
   1131 			 * back to the pool, unlock, call the hook, re-lock,
   1132 			 * and check the hardlimit condition again.
   1133 			 */
   1134 			mutex_exit(&pp->pr_lock);
   1135 			(*pp->pr_drain_hook)(pp->pr_drain_hook_arg, flags);
   1136 			mutex_enter(&pp->pr_lock);
   1137 			if (pp->pr_nout < pp->pr_hardlimit)
   1138 				goto startover;

https://nxr.netbsd.org/xref/src/sys/kern/subr_pool.c?r=1.293#1129

> Other CPUs are also trying to get the kernel_lock or the mbuf's pool lock.
> Several are in:
> mutex_vector_enter() at netbsd:mutex_vector_enter+0x209
> tcp_timer_rexmt() at netbsd:tcp_timer_rexmt+0x28
> callout_softclock() at netbsd:callout_softclock+0xd2
> softint_dispatch() at netbsd:softint_dispatch+0x11c

At tcp_timer_rexmt+0x28 (which is likely the first call after the
function prologue), I suspect this is waiting for softnet_lock, not
the mbuf pool lock:

    300 void
    301 tcp_timer_rexmt(void *arg)
    302 {
    303 	struct tcpcb *tp = arg;
    304 	uint32_t rto;
    305 #ifdef TCP_DEBUG
    306 	struct socket *so = NULL;
    307 	short ostate;
    308 #endif
    309 
    310 	mutex_enter(softnet_lock);

https://nxr.netbsd.org/xref/src/sys/netinet/tcp_timer.c?r=1.99#310


Home | Main Index | Thread Index | Old Index