kern/59411: deadlock on mbuf pool

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/59411: deadlock on mbuf pool
From: Manuel.Bouyer%lip6.fr@localhost
Date: Fri, 9 May 2025 15:15:00 +0000 (UTC)

>Number:         59411
>Category:       kern
>Synopsis:       deadlock on mbuf pool
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri May 09 15:15:00 +0000 2025
>Originator:     Manuel Bouyer
>Release:        NetBSD 10.1_STABLE
>Organization:
	LIP6
>Environment:
System: NetBSD ftp.lip6.fr 10.1_STABLE NetBSD 10.1_STABLE (FTP10) #8: Wed May  7 13:24:58 CEST 2025  bouyer%armandeche.soc.lip6.fr@localhost:/local/armandeche1/tmp/build/amd64/obj/local/armandeche2/netbsd-10/src/sys/arch/amd64/compile/FTP10 amd64
Architecture: x86_64
Machine: amd64
>Description:
	On this heavyly loaded web server I got what looks like a deadlock
	in the mbuf pool. I could enter ddb; the relevant stack traces are:
Stopped in pid 0.3 (system) at  netbsd:breakpoint+0x5:  leave
breakpoint() at netbsd:breakpoint+0x5
comintr() at netbsd:comintr+0x7e0
intr_wrapper() at netbsd:intr_wrapper+0x4b
Xhandle_ioapic_edge2() at netbsd:Xhandle_ioapic_edge2+0x6f
--- interrupt ---                      
mutex_vector_enter() at netbsd:mutex_vector_enter+0x3f0
pool_get() at netbsd:pool_get+0x69
pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x139
pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x233
m_get() at netbsd:m_get+0x37           
m_copy_internal() at netbsd:m_copy_internal+0x13e
tcp4_segment() at netbsd:tcp4_segment+0x1f9
ip_tso_output() at netbsd:ip_tso_output+0x24
ip_output() at netbsd:ip_output+0x18c4
tcp_output() at netbsd:tcp_output+0x165e
tcp_input() at netbsd:tcp_input+0xfd5
ipintr() at netbsd:ipintr+0x8f1
softint_dispatch() at netbsd:softint_dispatch+0x11c

db{0}> mach cpu 2
using CPU 2
db{0}> tr
_kernel_lock() at netbsd:_kernel_lock+0xd5
mb_drain() at netbsd:mb_drain+0x17
pool_grow() at netbsd:pool_grow+0x3b9
pool_get() at netbsd:pool_get+0x3c7
pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x139
pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x233
m_get() at netbsd:m_get+0x37
m_gethdr() at netbsd:m_gethdr+0x9
sosend() at netbsd:sosend+0x3d4
soo_write() at netbsd:soo_write+0x2f
dofilewrite() at netbsd:dofilewrite+0x80
sys_write() at netbsd:sys_write+0x49
syscall() at netbsd:syscall+0x196

CPU 0 holds the kernel lock and tried to get the pool lock.
CPU 2 holds the pool lock and tries to get the kernel lock.
As CPU 2 is spinning on kernel lock, the pool's mutex is spinning too
and doesn't sleep (which would release the kernel lock).

I also got a similar hang involving m_clget() instead of m_get()

>How-To-Repeat:
	I've seen this only on this server, and under heavy load (more than
	1500 active TCP connections, 2 gigabit interfaces at full speed).
>Fix:
	The patch below seems to fix the issues (runing it for 3 days now,
	without hangs). It doens't seems to hurt to not call the drain hook
	in the !PR_WAITOK case; it will be called asyncronously on RAM
	shortage anyway.
	Alternatively we could return early in mb_drain() in the !PR_WAITOK
	case, and audit other pool drain usages.

Index: kern/subr_pool.c
===================================================================
RCS file: /cvsroot/src/sys/kern/subr_pool.c,v
retrieving revision 1.285.4.2
diff -u -p -u -r1.285.4.2 subr_pool.c
--- kern/subr_pool.c	15 Dec 2024 14:58:45 -0000	1.285.4.2
+++ kern/subr_pool.c	9 May 2025 14:47:30 -0000
@@ -2967,6 +2957,7 @@ pool_allocator_alloc(struct pool *pp, in
 	void *res;
 
 	res = (*pa->pa_alloc)(pp, flags);
+#if 0
 	if (res == NULL && (flags & PR_WAITOK) == 0) {
 		/*
 		 * We only run the drain hook here if PR_NOWAIT.
@@ -2978,6 +2969,7 @@ pool_allocator_alloc(struct pool *pp, in
 			res = (*pa->pa_alloc)(pp, flags);
 		}
 	}
+#endif
 	return res;
 }

Prev by Date: Re: lib/59409 (Fix a typo in ukfs.3)
Next by Date: kern/59412: uvmpdpol_pagerealize() queue index out of bound
Previous by Thread: PR/59409 CVS commit: [netbsd-10] src/lib/libukfs
Next by Thread: kern/59412: uvmpdpol_pagerealize() queue index out of bound
Indexes:

Home | Main Index | Thread Index | Old Index