tech-kern: Re: Interrupts as threads

Subject: Re: Interrupts as threads
To: Bill Studenmund <wrstuden@netbsd.org>
From: Andrew Doran <ad@netbsd.org>
List: tech-kern
Date: 01/12/2007 14:38:54
Hi Bill,

On Tue, Dec 19, 2006 at 04:05:24PM -0800, Bill Studenmund wrote:

> On Sun, Dec 03, 2006 at 10:36:30PM +0000, Andrew Doran wrote:

> > The problem is that at the moment, acquiring any given spin lock is a
> > deadlock waiting to happen unless you are at ~IPL_SCHED or above when
> > acquiring it, or are certain that it will only ever be acquired with the
> > kernel lock unheld. Ensuring that the kernel lock is always held when you
> > acquire the spin lock means that the spin lock is useless :-).
> 
> I've been thinking about this, and I think you are not correct. Well. In
> the long-term, you are. But I think as a transition step, we may have to
> accept it.

I'm not sure I follow. We did discuss this briefly and I think that you were
proposing a scheme where we have the indivdual interrupt handlers acquire
the kernel lock, although I might have completely misunderstood. :-)

> All we have to do is define a correct locking hierarcy. It's ok to aquire
> a given spin lock if you have the kernel lock. It's ok to aquire said
> spinlock w/o the kernel lock. It's just NOT ok to aquire the kernel lock
> while holding said lock. Yes, this could make interupt handling routines 
> painful (when they hand things off to the main kernel), but we can do it.

The problem as I see it is: getting it right all at once would be a big
effort, and I don't see that we have the resources to do it that way.

I said that I wanted to use interrupts as threads, and process context locks
(the Solaris style mutexes) for MP safety. The two basic problems I want to
solve by doing that are:

o The one mentioned above - if the locks are difficult to use (potential
  deadlocks against the kernel_lock) then that's a problem.  For places
  where we want MP safety (not MT safety) the mutexes can be thought of as
  spin locks, but with additional property that they can block in order to
  avoid deadlock. That might be against the current CPU already holding the
  mutex (preemption), or against the CPU that already holds the mutex
  wanting the big lock. Essentially, they remove the ordering constraint
  against the big lock.

o If we keep the distinction between interrupt and process context across
  the board, then we're likely to increase the number of locks in the in
  kernel: one set for interrupt access, and one for process access. I've
  spent a lot of time trying to make process and LWP state MP safe for
  signalling. Even though it works well enough I'm not particularly happy
  with the end result, because there's a mix of locks where we only need
  ~half the number. It's not good to use spinlocks with a raised SPL in a
  lot of places, because we might need to block briefly on a process context
  lock, or in the overall picture we end up holding interrupts for much
  longer.

I did some simple profiling on the SPL operations and here is example output
from a run that I did. It's from a single CPU machine serving up files at
line rate using 8 ftpds over 100Mbps Ethernet, and doing some disk I/O
locally. I wasn't looking for any specific kind of behaviour other than
'doing I/O'. The machine is set up so that there is 1:1 mapping between each
symbolic interrupt level and each hardware IRQ line.

   1		   2			3		    4			5

alevel    |    intrs   persec |  blkself   persec | blkother   persec |  splraise   persec
---------+-------------------+-------------------+-------------------+--------------------
softnet  |   274094     6334 |    47549     1098 |        0        0 |    572849    13239
bio      |    17939      414 |      252        5 |        0        0 |   1415581    32717
net      |   373121     8623 |    29857      690 |     1760       40 |    631179    14588
tty      |        0        0 |        0        0 |        3        0 |       428        9
vm       |        0        0 |        0        0 |    11646      269 |   5211031   120438
audio    |        0        0 |        0        0 |        0        0 |         0        0
clock    |     4328      100 |      353        8 |    33913      783 |   1156023    26718
high     |        0        0 |        0        0 |      112        2 |    291410     6735
ipi      |        0        0 |        0        0 |        8        0 |       435       10

The columns are:

1. Interrupt level.
2. Number of interrupts at that level.
3. Number of splfoo() operations that blocked an interrupt at level foo. 
   This doesn't include the SPL adjustment that MD code does as part of
   taking the interrupt.
4. Number of splfoo() operations that blocked an interrupt below level
   foo. Note that this does not count blocked soft interrupts.
5. Number of splfoo() operations.

Given the amount of contention that this _one_ test shows it's clear that we
can't just use interrupts as threads and replace the SPL system with locks,
or we will really suffer from all the additional context switching. At
least, we can't do that without a major engineering effort to change how
work is passed from one level to the next.

I think that we can also cut the number of spl calls significantly. For one,
the number of splvm() calls indicated above seems a bit excessive.

What I propose is using interrupts as threads for IPL_VM and below, but also
preserving the SPL system. We would certianly need to use it for networking.
In other areas (like block I/O) where there aren't so many chokepoints I'm
inclined to believe that we can get away without it.

I don't think interrupts as threads have to be expensive: typically just a
stack switch and some additional accounting in the interrupt path. In
addition to the interrupt coming in, there would need to be defined
premption points, like lock release or splx(). That's in contrast to (to
use an example that people have cited) FreeBSD, where the initial approach
was a lot more costly as I understand it: in addition to taking the
interrupt, a thread had to be picked and scheduled to run at a later time,
for example on return to user space. I don't know how their system works
these days.

Thoughts?

Cheers,
Andrew