Subject: Re: The Right Way to implement threads?
To: Matthew Orgass <darkstar@pgh.net>
From: Nathan J. Williams <nathanw@MIT.EDU>
List: tech-kern
Date: 06/09/2000 03:05:05
<darkstar@pgh.net> (Matthew Orgass) writes:

>   I looked a some Solaris docs, dug out the last discussion on this topic,
> reread (some of) the paper on scheduler activations, and thought some more
> about thread models.  I have come up with a method that makes every thread
> preemptable and kernel scheduled, uses a reasonable amount of kernel
> resources, and should be somewhat faster then scheduler
> activations. 

I've been doing a lot of reading on this topic (I'm implementing
scheduler activation in NetBSD for my Master of Engineering thesis),
and I have some comments.

> 
>   This is done by separating kernel threads that are currently associated
> with processes from those processes and associating them only when needed. 
> All user thread would be scheduled by the kernel (needing somewhere around
> 100 bytes each of per-thread kernel memory) and do the equivalent of the
> many-to-many model inside the kernel.

OK, so you have a system where there is some per-thread kernel entity
that the kernel can schedule. Sounds like a LWP system so far. But
"many-to-many" models are characterized by one level (typically the
user level thread library) controlling which of its entities run on
the kernel's schedulable entities. The details of this control
mechanisim are what make the many-to-many models interesting. 

Maybe you just mean that time is divided among processes, and then
subdivided among threads, rather than having all threads competing
against each other (what POSIX calls "process" vs. "system" contention
scopes).

However, thread scheduling by the kernel is considered a lose for a
couple of reasons. First, a slightly academic point: it forces all
applications into one scheduling model. Sometimes different apps can
tweak the parameters, but they're fundamentally in the same regime.

Second and more importantly, switching from one thread to another now
requires a system call, and the associated expence of changing
protection domains from user space to the kernel and back. Two-level
scheduling systems go out of their way to avoid this and switch
context at the user level when possible (which is often); the original
scheduler activations paper and the performance analysis papers that
lead up to it show pretty clearly that making system calls to change
thread context is much more expensive than necessary.

>  Until a user thread makes a syscall, you don't really need a kernel
> thread.  So don't allocate one (from a pool) until it is needed.

Okay, what is a "kernel thread" that is different from the
kernel-scheduled entity mentioned previously? The kernel stack, I
suppose. But you don't just need that for system calls; you also need
it for handling other traps, such as page faults or FPE traps
(interrupts are currently handled on a process's kernel stack, but
don't have to be).

> Each process (see below for what exactly this means) would have a
> configurable number of guaranteed kernel threads available to do
> kernel work for it.  It could decide to allow overcommitting of
> kernel threads if there are enough extra available or require that
> all of its guaranteed kernel threads be reserved for its use only.

> If there was a shortage of kernel threads and a user thread needed a
> guaranteed thread while it was in use, the system call of an
> overcommitted process is interrupted and the kernel thread given
> back to its rightful owner.

Preempting the kernel work being done on behalf of the other thread?
This is a big, big, big deal. It requires reworking a large fraction
of the kernel in complicated ways to permit correct operation in the
face of kernel preemption. It's on par with reimplementing the whole
thing.  Requiring this makes your suggestion less feasable by orders
of magnitude.

Besides that, I'm not sure it makes sense. The resources being
consumed by the kernel thread are its kernel stack, which isn't
something that can be thrown away even if it is preempted.

> If a user thread in need was not able to procure a kernel thread, it
> would get in line for the next one available or perhaps take various
> measures against other lower priority threads in its process.  Whenever
> reasonable (I'm not sure how often this is, if ever), a user thread
> waiting for a kernel resource but not actually needing a kernel thread
> while waiting would not be associated with a kernel thread (unless it was
> reserved.  It might also be guaranteed one when it needs it).  As with the
> Solaris model, a signal could be sent to a process when all of its kernel
> threads are in use, allowing it to increase the number (subject to
> applicable limit) if desired. 

This seems like it is the heart of how your scheduling works, but it's
very unclear. 

>   A process always has just one signal handler table, but may have
> multiple address spaces and threads.

A single address space has historically been the key resource that
joined things together as a process. What do you mean, then, when you
say that a process may have multiple address spaces? Why is it useful
to consider entities with different address spaces as one process?

How can there only be one signal handler table if there are separate
address spaces? The handler functions won't exist in the different
spaces.

> Pthread signal behavior is used. 

The pthreads signal model is an atrocious compromise. It is useful
only because it is standard, and applications can expect it.

> The userland PID uniquely identifies this process, though it is not in any
> way related to thread IDs (TIDs) or address space IDs (VIDs).  That is, a
> process is something that you can control as a unit.  Only a process has a
> parent.  Processes are not directly scheduled, though they keep as much
> scheduling (and other) info as possible to lower the cost for each thread.

This is again standard LWP stuff.

>   Comments?

It would be worthwhile for you to give a comparison of this system
with other concurrency-providing interfaces (pure-userlevel threads,
one-to-one kernel threads, scheduler activations, asynchronous
threads, demand-allocated LWP's [Solaris]). In your first paragraph,
you say:

> I have come up with a method that makes every thread preemptable 

Pretty much every thread system makes the user-level threads
preemptable. You've said that you want the kernel side to be
preemptable as well but haven't even suggested how this very 
tricky task can be accomplished.

> and kernel scheduled,

This is not considered an advantage, for the reasons I gave above.

> uses a reasonable amount of kernel resources, and

This is a good goal but it's far from clear that you've achieved it.

> should be somewhat faster then scheduler activations. 

Why?

        - Nathan