NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/59411: deadlock on mbuf pool
>Number: 59411
>Category: kern
>Synopsis: deadlock on mbuf pool
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri May 09 15:15:00 +0000 2025
>Originator: Manuel Bouyer
>Release: NetBSD 10.1_STABLE
>Organization:
LIP6
>Environment:
System: NetBSD ftp.lip6.fr 10.1_STABLE NetBSD 10.1_STABLE (FTP10) #8: Wed May 7 13:24:58 CEST 2025 bouyer%armandeche.soc.lip6.fr@localhost:/local/armandeche1/tmp/build/amd64/obj/local/armandeche2/netbsd-10/src/sys/arch/amd64/compile/FTP10 amd64
Architecture: x86_64
Machine: amd64
>Description:
On this heavyly loaded web server I got what looks like a deadlock
in the mbuf pool. I could enter ddb; the relevant stack traces are:
Stopped in pid 0.3 (system) at netbsd:breakpoint+0x5: leave
breakpoint() at netbsd:breakpoint+0x5
comintr() at netbsd:comintr+0x7e0
intr_wrapper() at netbsd:intr_wrapper+0x4b
Xhandle_ioapic_edge2() at netbsd:Xhandle_ioapic_edge2+0x6f
--- interrupt ---
mutex_vector_enter() at netbsd:mutex_vector_enter+0x3f0
pool_get() at netbsd:pool_get+0x69
pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x139
pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x233
m_get() at netbsd:m_get+0x37
m_copy_internal() at netbsd:m_copy_internal+0x13e
tcp4_segment() at netbsd:tcp4_segment+0x1f9
ip_tso_output() at netbsd:ip_tso_output+0x24
ip_output() at netbsd:ip_output+0x18c4
tcp_output() at netbsd:tcp_output+0x165e
tcp_input() at netbsd:tcp_input+0xfd5
ipintr() at netbsd:ipintr+0x8f1
softint_dispatch() at netbsd:softint_dispatch+0x11c
db{0}> mach cpu 2
using CPU 2
db{0}> tr
_kernel_lock() at netbsd:_kernel_lock+0xd5
mb_drain() at netbsd:mb_drain+0x17
pool_grow() at netbsd:pool_grow+0x3b9
pool_get() at netbsd:pool_get+0x3c7
pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x139
pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x233
m_get() at netbsd:m_get+0x37
m_gethdr() at netbsd:m_gethdr+0x9
sosend() at netbsd:sosend+0x3d4
soo_write() at netbsd:soo_write+0x2f
dofilewrite() at netbsd:dofilewrite+0x80
sys_write() at netbsd:sys_write+0x49
syscall() at netbsd:syscall+0x196
CPU 0 holds the kernel lock and tried to get the pool lock.
CPU 2 holds the pool lock and tries to get the kernel lock.
As CPU 2 is spinning on kernel lock, the pool's mutex is spinning too
and doesn't sleep (which would release the kernel lock).
I also got a similar hang involving m_clget() instead of m_get()
>How-To-Repeat:
I've seen this only on this server, and under heavy load (more than
1500 active TCP connections, 2 gigabit interfaces at full speed).
>Fix:
The patch below seems to fix the issues (runing it for 3 days now,
without hangs). It doens't seems to hurt to not call the drain hook
in the !PR_WAITOK case; it will be called asyncronously on RAM
shortage anyway.
Alternatively we could return early in mb_drain() in the !PR_WAITOK
case, and audit other pool drain usages.
Index: kern/subr_pool.c
===================================================================
RCS file: /cvsroot/src/sys/kern/subr_pool.c,v
retrieving revision 1.285.4.2
diff -u -p -u -r1.285.4.2 subr_pool.c
--- kern/subr_pool.c 15 Dec 2024 14:58:45 -0000 1.285.4.2
+++ kern/subr_pool.c 9 May 2025 14:47:30 -0000
@@ -2967,6 +2957,7 @@ pool_allocator_alloc(struct pool *pp, in
void *res;
res = (*pa->pa_alloc)(pp, flags);
+#if 0
if (res == NULL && (flags & PR_WAITOK) == 0) {
/*
* We only run the drain hook here if PR_NOWAIT.
@@ -2978,6 +2969,7 @@ pool_allocator_alloc(struct pool *pp, in
res = (*pa->pa_alloc)(pp, flags);
}
}
+#endif
return res;
}
Home |
Main Index |
Thread Index |
Old Index