kernel linker wish

To: tech-kern%netbsd.org@localhost
Subject: kernel linker wish
From: David Young <dyoung%pobox.com@localhost>
Date: Wed, 2 Jan 2013 13:18:12 -0600
I was reminded while reading the 'options LUA' discussion about a
feature that I wished our kernel had.  I'll sketch it out here and hope
that it's interesting enough to someone who has enough spare time that
they can go and program it. :-)

It would be Nice(TM) if the kernel linker understood weak/strong aliases
in the kernel and in kernel modules so that using weak aliases one could
provide a stub implementation for an optional subsystem in the kernel,
and using strong aliases a loadable module could provide a full-fledged
implementation.  Taking bpf(4) as an example, the kernel could provide
weak aliases to stub routines for, e.g., bpf_attach(), bpf_mtap(), etc:

__weak_alias(bpf_attach, voidop);
__weak_alias(bpf_mtap, voidop);
.
.
.

and the BPF kernel module could provide strong aliases to the actual
implementation:

__strong_alias(bpf_attach, bpf_attach_impl);
__strong_alias(bpf_mtap, bpf_mtap_impl);
.
.
.

There are a couple of reasons, I think, to prefer this to the scheme
that bpf(4) uses now to provide a stub implementation and a modular
"real" implementation.  One, using the aliases scheme lets the kernel
patch in direct calls, so we could avoid indirect calls through the
bpf_ops vector.  Also, it's not necessary to create an operations
vector like bpf_ops for every module that we want to provide stub/real
implementations for.

A rough idea for how to implement this in the kernel linker
is this: when the kernel linker finds a strong alias
bpf_attach -> bpf_attach_impl in a kernel module that overrides
an existing weak alias bpf_attach -> voidop, it can "push" the
old alias onto a stack corresponding to the symbol 'bpf_attach',
push(aliases['bpf_attach'], voidop).  When it unloads the kernel module,
it re-assigns the alias: bpf_attach -> pop(aliases['bpf_attach']).
Let's say for now that the height of this stack is just 1 or 0.

Of course, you don't want to unload a kernel module while the kernel
is in it.  That is, you don't want for the text of, say, bpf_attach()
to go away while the kernel cv_wait()s inside it.  I believe you
can handle that using the entrance/exit-counting scheme for softc's
that I've described earlier (Subject: kicking everybody out of the
softc) in conjunction with a new modcmd, MODULE_CMD_CATCH, that
tells a module to change its behavior while the kernel unloads it.
Roughly, unloading a kernel module would go something like this:

1 Prepare the module to catch new threads as they try to the enter the
  module, modcmd(MODULE_CMD_CATCH).  Preparation may entail creating
  a mutex/condvar pair.  Threads that subsequently enter the
  module may have to acquire the mutex on the way in, signal the
  condvar and release the mutex on the way out.

2 Re-link the kernel's stub implementations (e.g.,
  bpf_attach -> voidop).  In this way, no more threads may enter the
  module, so we can hope for the next step to finish.

3 Module-specific cleanup, modcmd(MODULE_CMD_FINI), may acquire a
  mutex installed in step 1, and wait for every thread to quit the
  module---i.e., entrance count equals exit count---using a condvar
  installed in step 2.

  Sometimes this step may fail.  Putting things back the way they
  were should be possible, but it could be tricky.

4 Finish unlinking the module.  Reclaim the module's text/data memory.

Taking this a step further, suppose we want to layer one implementation
on another.  I.e., some module is stubbed out in the kernel.  We
load a module that provides implementation A.  Then we load another
module providing implementation B that refines implementation A.
Or vice versa: we load implementation B, first, implementation A
second, and A refines B.  I've been contemplating this in the
context of bus_space(9): one module may provide some debug
instrumentation such as an mmap(2)-able ring buffer of bus_space(9)
access records looking sort of like [I/O read | I/O write |
memory read | ..., address, width, value].  A second module may
provide advanced I/O exception handling.  And a third module may
re-order or delay reads and writes between bus barriers in order
to simulate important corner-cases of bus operation.  Any module
may refine either the behavior of the previously-loaded modules or
the behavior of the default implementation.  For example, let's
consider modules that override bus_space_read_4().  Say the default
implementation is in _bus_space_read_4:

__weak_alias(bus_space_read_4, _bus_space_read_4)

The module with debug instrumentation, bus_space_debug.kmod, has
a weak alias, bus_space_read_4, for its implementation called
debug_bus_space_read_4,

__weak_alias(bus_space_read_4, debug_bus_space_read_4)

It also reserves a private symbol for calling the implementation that it
overrides.  Call that symbol super_bus_space_read_4.

The module with exception handling, bus_space_xh.kmod, has
xh_bus_space_read_4,

__weak_alias(bus_space_read_4, xh_bus_space_read_4)

and likewise reserves a private symbol for calling the implementation it
overrode, also called super_bus_space_read_4.

If we load the modules bus_space_debug.kmod and bus_space_xh.kmod in
that order, then a call to bus_space_read_4 gets the xh_bus_space_read_4
implementation, which does its work and calls (through its symbol
super_bus_space_read_4) debug_bus_space_read_4, which does its work and
calls (through its super_bus_space_read_4) the default implementation,
_bus_space_read_4.

I think that to implement loading/unloading modules that refine each
other in this way, you could also use the aliases[symbol] stacks, but
they would grow taller than 0 or 1 items.

It is strange to use a weak alias to override a weak alias (why should a
loadable module's weak alias override the kernel's weak alias?); it may
be necessary to have a new kind of alias or else some meta-information
about each alias so that there is no ambiguity about what the kernel
linker should do.

Dave

-- 
David Young
dyoung%pobox.com@localhost    Urbana, IL    (217) 721-9981
Prev by Date: Re: SATA write performance problems (on HP MicroServer)
Next by Date: WAPBL and write cacheing (was: SATA write performance problems)
Previous by Thread: SATA write performance problems (on HP MicroServer)
Next by Thread: Winbond/Nuvoton W83795
Indexes:
Home | Main Index | Thread Index | Old Index