Re: kern/59411 (deadlock on mbuf pool)

To: gnats-bugs%netbsd.org@localhost
Subject: Re: kern/59411 (deadlock on mbuf pool)
From: Manuel Bouyer <manuel.bouyer%lip6.fr@localhost>
Date: Fri, 16 May 2025 11:50:07 +0200

On Fri, May 16, 2025 at 02:29:22AM +0000, riastradh%NetBSD.org@localhost wrote:
> [...]
> Until subr_pool.c rev. 1.220 in 2017, pool_grow dropped the lock around
> pool_allocator_alloc and thus avoided this deadlock, but it's unclear
> why that change was made:
> 
> 
> Module Name:    src
> Committed By:   christos
> Date:           Fri Dec 29 16:13:26 UTC 2017
> 
> Modified Files:
>         src/sys/kern: subr_pool.c
> 
> Log Message:
> Don't release the lock in the PR_NOWAIT allocation. Move flags setting
> after the acquiring the mutex. (from Tobias Nygren)
> ...
> @@ -1088,10 +1088,11 @@ pool_grow(struct pool *pp, int flags)
>  		}
>  	}
>  	pp->pr_flags |= PR_GROWING;
> -	if ((flags & PR_WAITOK) == 0)
> +	if (flags & PR_WAITOK)
> +		mutex_exit(&pp->pr_lock);
> +	else
>  		pp->pr_flags |= PR_GROWINGNOWAIT;
>  
> -	mutex_exit(&pp->pr_lock);
>  	char *cp = pool_allocator_alloc(pp, flags);
>  	if (__predict_false(cp == NULL))
>  		goto out;
> 
> https://mail-index.NetBSD.org/source-changes/2017/12/29/msg090787.html
> 
> 
> Can you try reverting that change and see what happens?
> 
> Unless we have an explanation of the motivation for that change, I
> think it is wrong, precisely because of the deadlock here: the pool
> lock MUST be released around the drain hook.

Testing this patch now:

Index: subr_pool.c
===================================================================
RCS file: /cvsroot/src/sys/kern/subr_pool.c,v
retrieving revision 1.285.4.2
diff -u -p -u -r1.285.4.2 subr_pool.c
--- subr_pool.c	15 Dec 2024 14:58:45 -0000	1.285.4.2
+++ subr_pool.c	16 May 2025 09:46:15 -0000
@@ -1411,11 +1411,10 @@ pool_grow(struct pool *pp, int flags)
 		}
 	}
 	pp->pr_flags |= PR_GROWING;
-	if (flags & PR_WAITOK)
-		mutex_exit(&pp->pr_lock);
-	else
+	if ((flags & PR_WAITOK) == 0)
 		pp->pr_flags |= PR_GROWINGNOWAIT;
 
+	mutex_exit(&pp->pr_lock);
 	storage = pool_allocator_alloc(pp, flags);
 	if (__predict_false(storage == NULL))
 		goto out;
@@ -1426,8 +1425,7 @@ pool_grow(struct pool *pp, int flags)
 		goto out;
 	}
 
-	if (flags & PR_WAITOK)
-		mutex_enter(&pp->pr_lock);
+	mutex_enter(&pp->pr_lock);
 	pool_prime_page(pp, storage, ph);
 	pp->pr_npagealloc++;
 	KASSERT(pp->pr_flags & PR_GROWING);
@@ -1439,8 +1437,7 @@ pool_grow(struct pool *pp, int flags)
 	cv_broadcast(&pp->pr_cv);
 	return 0;
 out:
-	if (flags & PR_WAITOK)
-		mutex_enter(&pp->pr_lock);
+	mutex_enter(&pp->pr_lock);
 	KASSERT(pp->pr_flags & PR_GROWING);
 	pp->pr_flags &= ~(PR_GROWING|PR_GROWINGNOWAIT);
 	return ENOMEM;


I will reboot the aftected server this evening; need to wait at last
monday evening to be sure the issue isn't back.

> 
> The surrounding PR_GROWINGNOWAIT logic -- which you recently
> half-deleted, though I have asked to restore it until we have
> adequately analyzed the situation and history
> (https://mail-index.NetBSD.org/tech-kern/2025/05/16/msg030467.html) --
> had been introduced two weeks prior by mrg@, and apparently reviewed by
> me (although I have no memory of it), but I can't find any record of
> the symptoms this change was supposed to fix:

With the above patch, PR_GROWINGNOWAIT isn't a NOOP any more.

-- 
Manuel Bouyer, LIP6, Sorbonne Université.           Manuel.Bouyer%lip6.fr@localhost
     NetBSD: 26 ans d'experience feront toujours la difference
--

Follow-Ups:
- Re: kern/59411 (deadlock on mbuf pool)
  - From: Manuel Bouyer

References:
- Re: kern/59411 (deadlock on mbuf pool)
  - From: riastradh

Prev by Date: port-amd64/59425: Firefox starts as black rectangle on Polaris12 (amdgpu)
Next by Date: Re: kern/59411 (deadlock on mbuf pool)
Previous by Thread: Re: kern/59411 (deadlock on mbuf pool)
Next by Thread: Re: kern/59411 (deadlock on mbuf pool)
Indexes:

Home | Main Index | Thread Index | Old Index