Devices.

To: tech-kern%netbsd.org@localhost
Subject: Devices.
From: David Holland <dholland-tech%netbsd.org@localhost>
Date: Sat, 29 May 2021 20:26:52 +0000
There are a number of infelicities in the way we currently handle the
I/O plumbing for devices in the kernel. These include:
   - cloning devices exist but as currently implemented violate
     layering abstractions;
   - every file system needs to have cutpaste vnode ops tables for
     device vnodes;
   - the split between block and character devices was never
     particularly well anchored in reality (e.g. tapes) but has aged
     poorly since as new classes of devices have appeared;
   - because we don't distinguish between different classes of devices
     nearly all device ops travel through the system as ioctls;
   - because of the ensuing complexity, dispatching of ioctls is a
     mess; there are many cases where ioctl handlers match some ioctls
     specifically and pass anything else on, which introduces
     opportunities for various kinds of bugs;
   - adding a new device-level operation to cdevsw or bdevsw requires
     touching every driver, including those it is completely
     irrelevant to;
   - if you have multiple sets of device nodes on a system (e.g. in
     chroots) operations that affect the device nodes themselves can
     behave differently or strangely depending on which copy you touch
     (we have had multiple generations of hacks to mitigate this, and
     none have been completely satisfactory);
   - because we don't distinguish device classes in any modern sense
     of the term they cannot be addressed or reasoned about in system
     config or things like kauth policies;
   - and probably other things I haven't thought of.

I've been mumbling on and off for a long time about various parts of
this problem, and I think it's time to propose a unified architecture
for a solution. Note that the changes required are nontrivial and this
is not going to happen all at once (or anytime soon); the goal of
blathering about it is to try to reach agreement on a place that we
want to get to eventually... and also, to smoke out any places where
the proposed architecture won't actually work or is inconsistent with
what happens on the ground.

There are four major interconnected sets of changes I have in mind to
address these problems.

(1) Create explicit device classes. This would be adding a layer of
indirection between struct cdevsw/bdevsw and drivers; so e.g. a mouse
driver would, instead of declaring a struct cdevsw, declare a struct
mouse_dev containing operations on mice, and the cdevsw entry would
point at this. For disks, which for historical reasons live in both
cdevsw and bdevsw, both entries would point at the same disk_dev.

(2) Abolish ioctl inside the kernel, or at least within the device
tree. Given separate device classes, each operation needed can be made
its own operation on that device class (that is, a function pointer in
struct foo_dev) with the ensuing large increase in clarity about what
the operations are and where they need to be implemented. Plus this
way we get type safety for the arguments.

(3) Rearrange the way operations dispatch to devices. The traditional
model is that opening a device gives you a device vnode, and device
ops are dispatched to the device driver by looking up the major number
and indirecting through either the cdevsw or bdevsw table, and passing
the minor number as an argument. (This throws away all information
about how or when the device was opened, which is why cloners needed
to do something different.) The proposed method is that device vnodes
resolve the identity of the device when first loaded, and at runtime
point to the e.g. struct disk_dev rather than remembering the major
number; and for cloners they point to a device instance structure
created when the device is opened, which holds the per-instance data
the cloner needs. (It's not clear to me right now if only cloners
should get instance structures, or if for uniformity it makes sense to
allocate them for every driver, or if it should be a property of
certain device classes -- the overhead of a allocating a handful of
extra small structures isn't important, so it's mostly a code
complexity question.) Operations on devices go to the device vnode and
are then sent on to the driver directly; the cdevsw and bdevsw tables
are used only when devices are first looked up. All ioctls are turned
into explicit device-level operations in the device vnode's ioctl op.

(3a) Further down the line it might make sense to make devices _not_
vnodes but instead make them different instances of struct file
(either one for all devices or even one for each device class) -- this
would move ioctl dispatching up a layer, which would be an
improvement. But I don't think this needs to be part of the initial
plan. Plus 

(4) Make device vnodes fs-independent and rearrange how looking them
up works. Get rid of the extra ops table for devices that every fs has
to have (and also the one for fifos); instead, make fs-level special
file vnodes mostly-inert objects that don't support anything much
besides getattr and setattr. Then, when namei produces a special file
vnode, look up the driver that it references, produce a device vnode
for that driver, and return that, with the FS's special file vnode
hanging off to the side so it can be used for stat and chown/chmod
operations. Note that in this model the device vnode is itself a
mostly-stateless wrapper; it points at a device instance and at a
special file vnode and dispatches operations to them, but doesn't do
much of anything itself. This makes multiple special files for the
same driver work as expected: all driver state is shared between all
opens, but each open is associated with a specific special file and
e.g. chmod on it won't affect others.

Note that this set of changes also enables something else I've been
talking about occasionally: storing driver names rather than major
numbers in device special file inodes. In this world driver lookup
happens only at namei time, rather than on every operation, so it's no
longer necessary for it to be especially fast and it becomes ok to do
it by string search. However, this is a separate matter and isn't part
of the plan (and may not even be a good idea) so let's not bikeshed it
just yet.

One question is: what device classes does it make sense to
materialize? ISTM that anything readily identifiable that there's more
than one of is a reasonable candidate (disks, ttys, audio,
framebuffers, mice, scanners, etc.) but I think the underlying
criterion should be something slightly different. This paper:
https://www.usenix.org/conference/osdi-04/recovering-device-drivers
observed that manifesting device classes lets you write recovery logic
such that if you need to shoot a driver, reset the hw, and restart it
you can then restore the driver to the state the rest of the system
expects it to be in. We ought to have an implementation of that :-) so
I think that should be the basis for thinking about device classes.

Another question is: how do minor numbers work in this world? I
suspect that for most drivers the path of least resistance is to
remember the minor number in the device vnode and pass it to the
device ops. But it's also reasonable to create a device instance for
each valid minor number, look that up at open time, and then dispatch
via that instance aftewards. It may depend on the device class... it
isn't clear to me right now whether cloners exist in all classes
(meaning that all device classes will need machinery for handling
explicit instances) or are specific to some classes and not others.
Right now because cloners are messy they're probably not used in all
the places they might potentially make sense.

A third question: how does this affect interfaces? The answer is:
hopefully as little as possible. Interfaces are their own mess :-|

Anyhow, I think this architecture addresses all the problems cited.
The critical question is: what have I overlooked? There are probably
some issues I've thought about but failed to remember to discuss
above; there are also probably some issues I've not thought about or
am completely unaware of.

If you are aware of any details anywhere that would explode all this
please post.

Also, if it seems unclear or vague on some particular point, please
post too; reactions of that form sometimes just mean I didn't write
clearly and should try again, but sometimes also reflect real problems
or issues that have been overlooked. The ways in which the different
sets of changes interact isn't necessarily obvious and might be wrong
in places.

And if you think it's all a terrible idea or that the problems at the
beginning are nonissues, that's important to know too.

Hopefully though we can reach some kind of conclusion about the
direction to aim in. (How to get there without exploding the world on
the way is then the next question...)

Note that what's in this message is a summary of things I've been
contemplating the past few years, and probably a fair number of people
have heard parts of it before, but I think this is the first time I've
tried to really roll it all together.

-- 
David A. Holland
dholland%netbsd.org@localhost
Follow-Ups:
- Re: Devices.
  - From: Brian Buhrow
- Re: Devices.
  - From: Johnny Billquist
- Re: Devices.
  - From: Paul Goyette
- Re: Devices.
  - From: Mouse
Prev by Date: Re: Is there a command to change btime (creation time of files)?
Next by Date: Re: Devices.
Previous by Thread: Is there a command to change btime (creation time of files)?
Next by Thread: Re: Devices.
Indexes:
Home | Main Index | Thread Index | Old Index