Subject: Re: The Right Way to implement threads?
To: Nathan J. Williams <nathanw@MIT.EDU>
From: Matthew Orgass <darkstar@pgh.net>
List: tech-kern
Date: 06/10/2000 01:58:40
On 9 Jun 2000, Nathan J. Williams wrote:

  Thanks for your comments!

> I've been doing a lot of reading on this topic (I'm implementing
> scheduler activation in NetBSD for my Master of Engineering thesis),
> and I have some comments.

  Great!  It would be interesting to compare working versions of the two.

> OK, so you have a system where there is some per-thread kernel entity
> that the kernel can schedule. Sounds like a LWP system so far. But
> "many-to-many" models are characterized by one level (typically the
> user level thread library) controlling which of its entities run on
> the kernel's schedulable entities. The details of this control
> mechanisim are what make the many-to-many models interesting. 

  I was mainly thinking of many-to-many as having a larger number of user
threads then kernel threads.  It wasn't a very good comparison.

> Maybe you just mean that time is divided among processes, and then
> subdivided among threads, rather than having all threads competing
> against each other (what POSIX calls "process" vs. "system" contention
> scopes).

  Actually, I haven't worked out the details of the scheduler yet and it
would certainly provide many opportunities for research in any case.  It
would at least take thread priority into account.

> However, thread scheduling by the kernel is considered a lose for a
> couple of reasons. First, a slightly academic point: it forces all
> applications into one scheduling model. Sometimes different apps can
> tweak the parameters, but they're fundamentally in the same regime.

  The kernel scheduler could implement multiple models if desired.  While
the program itself couldn't implement an arbitrary model without also
implementing a Solaris type thread libary, I doubt that many want to or
should.

> Second and more importantly, switching from one thread to another now
> requires a system call, and the associated expence of changing
> protection domains from user space to the kernel and back. Two-level
> scheduling systems go out of their way to avoid this and switch
> context at the user level when possible (which is often); the original
> scheduler activations paper and the performance analysis papers that
> lead up to it show pretty clearly that making system calls to change
> thread context is much more expensive than necessary.

  I disagree with this analysis.  As the Linux folks are fond of pointing
out, the cost of a kernel context switch (which can be made quite low) is
not the only consideration:  cache misses cost quite a bit too and will
happen in either case.  Furthermore, many thread switches occur when
blocking for system resources or at the end of a time slice, in which case
you have already entered the kernel.  In these cases, doing context
switches in kernel will be faster.  In addition, the upcalls needed for
scheduler activations take a little time too.  I suspect that the net
effect will be that most programs will be slightly faster with my method.
Since the kernel manages all threads, thread priorities become global
decreasing the time that a lower priority thread executes while a higher
priority thread is ready to run.  This makes it appear to be even faster
then it is.

> Okay, what is a "kernel thread" that is different from the
> kernel-scheduled entity mentioned previously? The kernel stack, I
> suppose. But you don't just need that for system calls; you also need
> it for handling other traps, such as page faults or FPE traps
> (interrupts are currently handled on a process's kernel stack, but
> don't have to be).

  Each CPU also gets a stack to be used as the initial kernel stack for
the running process.  If the task is brief, it deals with it on that
stack, otherwise it gets shoved out to a kernel thread.

> Preempting the kernel work being done on behalf of the other thread?
> This is a big, big, big deal. It requires reworking a large fraction
> of the kernel in complicated ways to permit correct operation in the
> face of kernel preemption. It's on par with reimplementing the whole
> thing.  Requiring this makes your suggestion less feasable by orders
> of magnitude.

  The syscall is canceled, just as happens today with signals.  It's not a
requirement, just something that can be done if needed.  There would be a
system default and a per-thread control.

[Actually, I think Jason Thorpe recently said that there is not much
stopping kernel threads from being preemptable now.  But this is not
required for my method.]

> Besides that, I'm not sure it makes sense. The resources being
> consumed by the kernel thread are its kernel stack, which isn't
> something that can be thrown away even if it is preempted.

  Why not?

> > If a user thread in need was not able to procure a kernel thread, it
> > would get in line for the next one available or perhaps take various
> > measures against other lower priority threads in its process.  Whenever
> > reasonable (I'm not sure how often this is, if ever), a user thread
> > waiting for a kernel resource but not actually needing a kernel thread
> > while waiting would not be associated with a kernel thread (unless it was
> > reserved.  It might also be guaranteed one when it needs it).  As with the
> > Solaris model, a signal could be sent to a process when all of its kernel
> > threads are in use, allowing it to increase the number (subject to
> > applicable limit) if desired. 
> 
> This seems like it is the heart of how your scheduling works, but it's
> very unclear. 

  When a user thread is created, it is set to use the shared-per-process
kernel thread pool.  It can request that it have a kernel thread reserved
for its use only.  This thread pair would then act like standard LWPs.  It
could also have one guaranteed (either for its own use or a per-process
total that has not been reached), which means that if the system runs out
of available kernel threads it will steal one that has been overcommited.
If neither of these is the case, it needs one, and the system has reached
its limit, it will be put on a wait queue until one is available.

  I am working on a some additions to this scheme, but this is the basic
idea.

> >   A process always has just one signal handler table, but may have
> > multiple address spaces and threads.
> 
> A single address space has historically been the key resource that
> joined things together as a process. What do you mean, then, when you
> say that a process may have multiple address spaces? Why is it useful
> to consider entities with different address spaces as one process?

  It is mainly semantics: a process is what dies when you type kill <pid>. 
From a userland perspective the only thing you can do to with a PID is
signal it.  Therefore it is the response to this signal that should define
what a process is.

> How can there only be one signal handler table if there are separate
> address spaces? The handler functions won't exist in the different
> spaces.

  Because you had to fork address spaces to get there, so the handler
functions are either shared or copied.

> It would be worthwhile for you to give a comparison of this system
> with other concurrency-providing interfaces (pure-userlevel threads,
> one-to-one kernel threads, scheduler activations, asynchronous
> threads, demand-allocated LWP's [Solaris]).

  The main benefit of my method is global thread priority without the
overhead of one-to-one.

Matthew Orgass
darkstar@pgh.net