Subject: Re: yet another SA assertion failure with 3.0
To: None <tech-kern@netbsd.org>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 11/07/2005 20:17:03
--jRHKVT23PllUwdXP
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

On Tue, Oct 25, 2005 at 11:37:20PM -0700, Chuck Silvers wrote:
> while looking into the problem with freeing the sau twice,
> I hit another assertion fairly repeatably:
> 
> panic: kernel debugging assertion "vp->savp_lwp == l2" failed: file "../../../../kern/kern_sa.c", line 1021
> 
> the way I trigger this is to start named, run another program that
> allocates enough memory that named is completely paged out, then send
> named a SIGKILL.  running this in a loop for a while has triggered the
> assertion several times.  does anyone else want to look at this one?

the problem is in this block of code in kpsignal2():

                if (l == NULL) {
                        /*
                         * Special case: SIGKILL of a process
                         * which is entirely composed of
                         * suspended LWPs should succeed. We
                         * make this happen by unsuspending one of
                         * them.
                         */
                        if (allsusp && (signum == SIGKILL)) {
                                if (p->p_flag & P_SA) {
                                        /*
                                         * get a suspended lwp from
                                         * the cache to send KILL
                                         * signal
                                         * XXXcl add signal checks at resume points
                                         */
                                        suspended = sa_getcachelwp
                                                (SLIST_FIRST(&p->p_sa->sa_vps));
                                }
                                lwp_continue(suspended);
                        }
                        goto done;
                }


the problem that we've been seeing most recently (PR 28886) is that the
LWP that we make runable here won't own the VP, so that LWP will hit
the assertion about that in sa_switchcall().

the other problem (PR 26771) is that the VP may not even have any cached
LWPs available, so we'll crash in lwp_continue().

I tried adding some code to make sure that an LWP is available and that
that LWP owns the VP, but that led to more problems.  that code is attached
in case anyone is curious.

before I spend a bunch of time fiddling with this,
does anyone have any suggestions on how to handle this case?

-Chuck

--jRHKVT23PllUwdXP
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="diff.sa-sigkill"

Index: src/sys/kern/kern_sig.c
===================================================================
RCS file: /cvsroot/src/sys/kern/kern_sig.c,v
retrieving revision 1.210
diff -u -p -r1.210 kern_sig.c
--- src/sys/kern/kern_sig.c	23 Oct 2005 00:09:14 -0000	1.210
+++ src/sys/kern/kern_sig.c	8 Nov 2005 03:35:31 -0000
@@ -1037,6 +1037,7 @@ kpsignal2(struct proc *p, const ksiginfo
 	else
 		SCHED_ASSERT_LOCKED();
 #endif
+	KASSERT(dolock || signum != SIGKILL || p == curproc);
 
 	/*
 	 * Notify any interested parties in the signal.
@@ -1142,6 +1143,20 @@ kpsignal2(struct proc *p, const ksiginfo
 		ksiginfo_put(p, ksi);
 		return;
 	}
+
+	/*
+	 * Make sure that we have an available LWP to process SIGKILL
+	 * for SA processes.
+	 */
+	if (signum == SIGKILL && p->p_flag & P_SA && p != curproc) {
+needlwp:
+		vp = SLIST_FIRST(&p->p_sa->sa_vps);
+		if (vp->savp_ncached == 0) {
+			sa_newcachelwp(LIST_FIRST(&p->p_lwps));
+		}
+	}
+
+
 	/* XXXSMP: works, but icky */
 	if (dolock)
 		SCHED_LOCK(s);
@@ -1182,7 +1197,8 @@ kpsignal2(struct proc *p, const ksiginfo
 		} else if (p->p_stat == SSTOP) {
 			SLIST_FOREACH(vp, &p->p_sa->sa_vps, savp_next) {
 				l = vp->savp_lwp;
-				if (l->l_stat == LSSLEEP && (l->l_flag & L_SINTR) != 0)
+				if (l->l_stat == LSSLEEP &&
+				    (l->l_flag & L_SINTR) != 0)
 					break;
 				l = NULL;
 			}
@@ -1260,28 +1276,28 @@ kpsignal2(struct proc *p, const ksiginfo
 		}
 
 		if (l == NULL) {
+
 			/*
-			 * Special case: SIGKILL of a process
-			 * which is entirely composed of
-			 * suspended LWPs should succeed. We
-			 * make this happen by unsuspending one of
-			 * them.
+			 * Special case: SIGKILL of a process which is entirely
+			 * composed of suspended LWPs should succeed. We make
+			 * this happen by unsuspending one of them.
 			 */
-			if (allsusp && (signum == SIGKILL)) {
+
+			if (allsusp && signum == SIGKILL) {
 				if (p->p_flag & P_SA) {
-					/*
-					 * get a suspended lwp from
-					 * the cache to send KILL
-					 * signal
-					 * XXXcl add signal checks at resume points
-					 */
-					suspended = sa_getcachelwp
-						(SLIST_FIRST(&p->p_sa->sa_vps));
+					vp = SLIST_FIRST(&p->p_sa->sa_vps);
+					suspended = sa_getcachelwp(vp);
+					if (suspended == NULL) {
+						SCHED_UNLOCK(s);
+						goto needlwp;
+					}
+					vp->savp_lwp = suspended;
 				}
 				lwp_continue(suspended);
 			}
 			goto done;
 		}
+
 		/*
 		 * All other (caught or default) signals
 		 * cause the process to run.

--jRHKVT23PllUwdXP--