Subject: Fwd: libpthread killed my dog, part N+1
To: None <tech-kern@netbsd.org>
From: Charles M. Hannum <abuse@spamalicious.com>
List: tech-kern
Date: 01/06/2005 01:52:50
--Boundary-00=_ynJ3Bkr8RHNtwuJ
Content-Type: text/plain;
  charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Since the previous discussion(?) was here, I'm forwarding the latest bit, 
though it seems not to be a kernel issue.

--Boundary-00=_ynJ3Bkr8RHNtwuJ
Content-Type: message/rfc822;
  name="forwarded message"
Content-Transfer-Encoding: 7bit
Content-Description: "Charles M. Hannum" <abuse@spamalicious.com>: libpthread killed my dog, part N+1
Content-Disposition: inline

	by po12.mit.edu (Cyrus v2.1.5) with LMTP; Wed, 05 Jan 2005 20:35:45 -0500
	by pacific-carrier-annex.mit.edu (8.12.4/8.9.2) with ESMTP id j061ZdLv025225
	for <mycroft@mit.edu>; Wed, 5 Jan 2005 20:35:40 -0500 (EST)
	id B089D53B7; Thu,  6 Jan 2005 01:35:38 +0000 (UTC)
	id 189C15343; Thu,  6 Jan 2005 01:35:32 +0000 (UTC)
	by mail.netbsd.org (Postfix) with ESMTP id 35692517D
	for <tech-userlevel@netbsd.org>; Thu,  6 Jan 2005 01:35:30 +0000 (UTC)
	id 3E69C2A65C4; Thu,  6 Jan 2005 01:35:24 +0000 (UTC)
From: "Charles M. Hannum" <abuse@spamalicious.com>
Organization: By Noon Software, Inc.
To: tech-userlevel@NetBSD.org
Subject: libpthread killed my dog, part N+1
Date: Thu, 6 Jan 2005 01:35:23 +0000
User-Agent: KMail/1.7
MIME-Version: 1.0
Content-Type: text/plain;
  charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200501060135.23701.abuse@spamalicious.com>
Sender: tech-userlevel-owner@NetBSD.org

I have discovered another deadlock, *and* the reason for upcall exhaustion.

Let us review.  When we receive a SA_UPCALL_UNBLOCKED for a thread holding a 
spinlock, we caused an immediately switch to that thread from 
pthread__resolve_locks(), presumably on the theory that it will finish and 
unlock immediately.  Note that at this point, pt_blockgen==pt_unblockgen+1; 
pt_unblockgen gets incremented again after pthread__resolve_locks() returns 
and we call pthread__sched_bulk().

However, it may happen that the thread blocks again.  When this happens, we 
now have a chain of upcall thread(s) implicitly blocked waiting for it.  In 
addition, pt_blockgen==pt_unblockgen+3.

Eventually we will get another SA_UPCALL_UNBLOCKED.  When this happens, if we 
are lucky, the thread will finish with the lock, and the hack in 
pthread_spinunlock() will switch back to the upcall thread immediately.  At 
this point, pt_blockgen==pt_unblockgen+2 (because we received two unblocks).

At this point, the upcall chain will terminate, pthread__sched_bulk() will be 
called, and because pt_unblockgen is already even, it will not be 
incremented!  Note that we are screwed now; various pieces of code will 
evermore think that the thread is blocked.  This leads to one form of 
deadlock (signal delivery will never succeed, and the thread can get stuck 
repeatedly taking a trap).

Even if I fix the even-odd test in pthread__sched_bulk(), this problem can 
still lead to upcall exhaustion, by causing a chain of upcalls to be stuck.  
I think -- but I'm not sure yet -- that they actually spin on the CPU, 
waiting for the unblock that will allow them to continue.

Somehow, in all this mess, pthread__concurrency also becomes -1.  I'm not sure 
exactly how that happens.


This really needs to be fixed, somehow.

--Boundary-00=_ynJ3Bkr8RHNtwuJ--