Subject: Fwd: libpthread killed my dog, part N+1
To: None <firstname.lastname@example.org>
From: Charles M. Hannum <email@example.com>
Date: 01/06/2005 01:52:50
Since the previous discussion(?) was here, I'm forwarding the latest bit,
though it seems not to be a kernel issue.
Content-Description: "Charles M. Hannum" <firstname.lastname@example.org>: libpthread killed my dog, part N+1
by po12.mit.edu (Cyrus v2.1.5) with LMTP; Wed, 05 Jan 2005 20:35:45 -0500
by pacific-carrier-annex.mit.edu (8.12.4/8.9.2) with ESMTP id j061ZdLv025225
for <email@example.com>; Wed, 5 Jan 2005 20:35:40 -0500 (EST)
id B089D53B7; Thu, 6 Jan 2005 01:35:38 +0000 (UTC)
id 189C15343; Thu, 6 Jan 2005 01:35:32 +0000 (UTC)
by mail.netbsd.org (Postfix) with ESMTP id 35692517D
for <firstname.lastname@example.org>; Thu, 6 Jan 2005 01:35:30 +0000 (UTC)
id 3E69C2A65C4; Thu, 6 Jan 2005 01:35:24 +0000 (UTC)
From: "Charles M. Hannum" <email@example.com>
Organization: By Noon Software, Inc.
Subject: libpthread killed my dog, part N+1
Date: Thu, 6 Jan 2005 01:35:23 +0000
I have discovered another deadlock, *and* the reason for upcall exhaustion.
Let us review. When we receive a SA_UPCALL_UNBLOCKED for a thread holding a
spinlock, we caused an immediately switch to that thread from
pthread__resolve_locks(), presumably on the theory that it will finish and
unlock immediately. Note that at this point, pt_blockgen==pt_unblockgen+1;
pt_unblockgen gets incremented again after pthread__resolve_locks() returns
and we call pthread__sched_bulk().
However, it may happen that the thread blocks again. When this happens, we
now have a chain of upcall thread(s) implicitly blocked waiting for it. In
Eventually we will get another SA_UPCALL_UNBLOCKED. When this happens, if we
are lucky, the thread will finish with the lock, and the hack in
pthread_spinunlock() will switch back to the upcall thread immediately. At
this point, pt_blockgen==pt_unblockgen+2 (because we received two unblocks).
At this point, the upcall chain will terminate, pthread__sched_bulk() will be
called, and because pt_unblockgen is already even, it will not be
incremented! Note that we are screwed now; various pieces of code will
evermore think that the thread is blocked. This leads to one form of
deadlock (signal delivery will never succeed, and the thread can get stuck
repeatedly taking a trap).
Even if I fix the even-odd test in pthread__sched_bulk(), this problem can
still lead to upcall exhaustion, by causing a chain of upcalls to be stuck.
I think -- but I'm not sure yet -- that they actually spin on the CPU,
waiting for the unblock that will allow them to continue.
Somehow, in all this mess, pthread__concurrency also becomes -1. I'm not sure
exactly how that happens.
This really needs to be fixed, somehow.