tech-smp: smp status -- more details..

Subject: smp status -- more details..
To: None <tech-smp@netbsd.org, tech-kern@netbsd.org, tech-ports@netbsd.org>
From: Bill Sommerfeld <sommerfeld@orchard.arlington.ma.us>
List: tech-smp
Date: 08/19/2000 00:12:47
Here's an overview of where I am right now..  Note that significant
parts of the code described here have not yet committed to -current;
some of the details described below are likely to change.

Locks, locks and more locks:

First, an overview of the locking primitives available:

1) __cpu_simple_lock: primitive mutual exclusion.  This is defined in
<machine/lock.h>, and is the fundamental locking primitive each port
must define.  Only locking code and special-purpose routines (e.g.,
debugger, kernel printf) should use this type.  __cpu_simple_lock is a
simple "spin until you get it", with no debugging assistance
available; also, it is intended that __cpu_simple_lock should be
usable in user-space applications which need atomic operations.

2) simple_lock: mutual exclusion.  Locks are conceptually "owned" by
CPU's.  With options LOCKDEBUG, this expands to include debug tests
for a number of common locking errors: unlock of a lock we don't own,
lock of a lock we already hold, etc.., Without options LOCKDEBUG, the
simple_lock abstraction collapses to be equivalent to for
__cpu_simple_lock.  

With one exception (see below), none of these locks should be held at
context-switch time.  A "typical" simple_lock is the scheduler lock.

3) spinlock (struct lock with LK_SPIN): reader-writer spinlock, owned
by a CPU.  This is used when simple non-recursive mutual exclusion is
not sufficient.  There are two primary uses: the proclist_lock (used
as a reader-writer lock), and the kernel_lock (used as a recursive
lock).  As with simple_locks, none of these should be held at
context-switch time.

4) sleep lock (struct lock without LK_SPIN): reader-writer lock owned
by a process (thread of control); if it's not available, the caller
will sleep.  The most common example of these are vnode locks.

Locks and spl's:

Traditionally unix has used spl's to keep interrupt routines from
running in a critical section.  spl's are really a scheduling
primitive, not a locking primitive; to exclude an interrupt routine
from a critical section, you need to run at a priority as high as or
higher than the interrupt for the duration of the critical section.

This obviously breaks down in the multiprocessor case, since each cpu
will be operating at a different priority; however, because the bulk
of the existing code uses spl's, we need to make locks and spl's play
nice together..

Locks which may be taken from interrupt code must be handled very
carefully; you must spl to the highest IPL where the lock is needed
before acquiring the lock, so that, while holding the lock, you cannot
be interrupted by code which needs to acquire the lock.

It is also important to avoid attempting to acquire a lock at too high
a level, since certain (very high priority) interrupts are often
needed to keep the system as a whole from deadlocking, and must not be
blocked while you are spinning waiting for a lower-priority lock.

In addition, the lock-debugging hooks themselves need to use locks!

A raw __cpu_simple_lock may be used from interrupts as long as it
is acquired and held at a single IPL.

A simple_lock (which is a __cpu_simple_lock wrapped with some
debugging hooks) may be used at or below spllock(), which is typically
at or above splhigh()

lockmgr spinlocks may be used at or below splsched(), which is
typically equal to traditional splhigh()

Some ports may have interrupts of higher priority than splsched(),
including hard serial interrupts, inter-processor interrupts, and
kernel debugger IPI's.
 
Synchronization:

The traditional UNIX synchronization model uses wakeup() and sleep();
[BSD4.4 added tsleep(), which takes an additional timeout parameter.]

In the example pseudo-code:

	while (queue is empty)
		sleep(&queue);

there is a small window between when the queue is checked and when the
proces goes to sleep.  In the traditional UNIX uniprocessor model, the
kernel is scheduled cooperatively, so the above code is safe -- no
other process could get in sideways before the slep.

However, with the addition of multiple processors, this assumption
can no longer be maintained.  We have extended the UNIX tsleep() call
into ltsleep(), which maintains a lock until after the process has
been "put to sleep", so that the window can be closed.

	simple_lock(&queue_lock);
	while (queue is empty)
		ltsleep(&queue, PQUEUE, 0, 0, &queue_lock);
	...
	simple_unlock(&queue_lock);

Normally, ltsleep will relock the lock after the process has
reawakened; however, it will not do that if you "or" the sleep
priority with PNORELOCK.

The kernel lock:

In order to avoid having to rewrite all existing code to add locks and
replace sleep/tsleep with ltsleep, we've added a "kernel_lock", which
must be held when non-MP-safe code needs to run.  

For now, that's pretty much the whole kernel..

The kernel is always entered through machine-dependant code.  There
are three basic ways in:
	- system calls
	- traps/exceptions
	- interrupts 

With the "big lock" model, the kernel lock must be acquired on entry
to the kernel.   This can happen in several ways:

	- it can be acquired on behalf of a process (e.g, for system
call or page fault); this involves acquiring the lock and setting the
P_BIGLOCK flag in p->p_flag.

	- it can be acquired to run an interrupt handler.
In the latter case, the kernel may already be running on this CPU, and
thus it has to be acquired with the LK_CANRECURSE flag (indicating
that recursive locking is OK).

	- in the case of a page fault in kernel mode (for instance,
when running copyin() or copyout() between kernel space and a
non-resident user page), the lock needs to be acquired recursively.

At tsleep/ltsleep time, if the process which is about to go to sleep
holds the kernel lock (i.e., p->p_flag & P_BIGLOCK is set), it is
released; it is then reacquired after the process wakes up.

The scheduler lock:

Currently, the scheduler uses a single set of run queues across all
CPU's.  Access to them is controlled via a simple_lock known as
sched_lock.  <sys/sched.h> includes SCHED_LOCK() (which brings the
system to splsched() and acquires the scheduler lock) and SCHED_UNLOCK
(which undoes that work).

It's worth noting that the restructuring involved in adding the
SCHED_LOCK calls allowed for the removal of some redundant
splhigh()/splx() pairs, which may (slightly) improve performance even
in the uniprocessor case.
 
Lock ordering:

To avoid deadlocks, any pair of locks which may be held simultaneously
must always be acquired in the same order.

The general order right now is:

	kernel lock first
	(miscellaneous other locks)
	sched_lock 
	LOCKDEBUG and kernel printf-related locks.

Don't be surprised if some enhanced LOCKDEBUG code winds up enforcing
this ordering more rigorously.