Subject: Re: kern/37437: signal problems in linux threads
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Andrew Doran <ad@netbsd.org>
List: netbsd-bugs
Date: 11/27/2007 02:20:02
The following reply was made to PR kern/37437; it has been noted by GNATS.

From: Andrew Doran <ad@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/37437: signal problems in linux threads
Date: Tue, 27 Nov 2007 02:15:11 +0000

 On Mon, Nov 26, 2007 at 08:45:02PM +0000, Andrew Doran wrote:
 
 >  I agree. While we are at it, it would be worthwhile checking our current
 >  behaviour in the presence of LWPs against POSIX. I did look before when
 >  doing the per-LWP signal stuff but it has been nearly a year and I've
 >  forgotten.
 
 I checked the behaviour against the standards wrt threads and it's fine,
 this stuff is all process global. I also tidied up the sigacts locking a bit
 (has been committed), although it doesn't affect the problem being described
 here.
 
 I did find one problem with sharing sigignore/sigcatch between processes.
 Each process has a a lock that covers its signal state along with a bunch of
 other things: p_smutex. Within a single process all LWPs share that lock, so
 there is a consistent view of the signal state across routines where the
 lock is held like kpsignal2().
 
 If sigignore/sigcatch are shared between processes then we have a problem
 because the per-process lock is no longer enough and the view of signal
 state is no longer consistent. There is an out, though: in nearly all the
 places those fields are inspected, we hold proclist_mutex which is a global,
 so we can say sigignore/sigcatch and the signal actions are locked by that.
 
 The two exceptions I see are sendsig_reset() and sigaction1(), the lock is
 held by neither of those routines. In sigaction1(), proclist_mutex can be
 acquired without problem just before p_smutex is acquired. sendsig_reset()
 is a bit more tricky because the lock hierarchy is proclist_mutex ->
 p_smutex, and it is called with p_smutex already held.
 
 spanners$ grep sendsig_reset `find src/sys/arch -name "*.c"` | wc -l
       40
 spanners$ grep sendsig_reset `find src/sys/compat -name "*.c"` | wc -l
       13
 
 There are 53 call sites so the major issue is trying to find a way to weasel
 the lock acquisition in without changing all of them :-). At this point my
 e-mail is about to get even more long winded and rambling. Here's another
 tremendous blast of hot air to add to the above:
 
 Each LWP has its own private signal state that is never accessed by other
 LWPs, like the signal mask and whether or not it's on the alt signal stack.
 The locking scheme was designed with SA (scheduler activations) in mind,
 shortly before it was removed from -current. In the SA universe, the signal
 mask is process global. So calls to the signal delivery routines (usually
 called sendsig_foo or foo_sendsig) are made with p_smutex held so we have
 atomic access to the global signal mask, etc. Obviously that is not needed
 any more since we don't have SA any more. We could:
 
 o Remove the back-calls to sendsig_reset() from the delivery routines.
 o Don't cover the delivery routines with p_smutex.
 o Have kpsendsig() do sendsig_reset()'s job before we call down to the
   delivery routine, and unlock before we call down.
 o Additionally, have kpsendsig() extract sa_handler while under lock
   (i.e. atomically wrt the other signal state) and have it pass this
   in as an argument to the delivery routine.
 
 Andrew