Subject: Re: partition bookkeeping
To: Greywolf <greywolf@starwolf.com>
From: Bill Studenmund <wrstuden@nas.nasa.gov>
List: tech-kern
Date: 09/22/1999 19:23:58
On Wed, 22 Sep 1999, Greywolf wrote:
> On Wed, 22 Sep 1999, der Mouse wrote:
> 
> # This is one of the arguments in favor of devfs (at least, those who
> # like c0t0s0 names might think so): it makes this possible.  There's no
> # reason you couldn't have minors allocated only for those devices that
> # actually exist; as long as /dev/dsk/dks0d2s5 gets partition 5 on device
> # 2 on controller 0, nobody cares whether its minor number is anything in
> # particular.
> 
> A devfs is completely extricate of the naming convention.  Names are
> for the humanoid types.
> 
> If we move all the disklabel out of the kernel, it is worth asking
> the question of whether or not the minor number will ever be used,
> since in the device driver, the minor number will show the disk number
> (say in all the bits above the fifth) and the partition to reference
> on that disk (on bits 0-4 [22 partitions overflows four bits).
> If the kernel has no disklabel/offset information, how does the
> driver get the bus/controller/unit/offset information?
I think the general idea was that the daemon (or routines in the kernel) 
would have to fill in the partition info in the kernel before the
partitions could be used. I don't think the idea was to have the daemon
get called for each access to an uninitialized partition. :-)
> Let me guess, using fsck as an example:
> 
> 	fsck gets "/dev/disk/rsd2d" as its device
> 	fsck calls open("/dev/disk/rsd2d")		CS #1
> 	Eventually the vnode gets resolved, and
> 		the open routine for the vnode goes
> 		out to userland to talk to this daemon
> 		which handles all the device mappings	CS #2
> 	The daemon runs and returns the information
> 		to the vopen (or whatever)		CS #3
> 	The filehandle finally gets returned to
> 		fsck.					CS #4
> 
> Congratulations.  We're now doing twice as many context switches
> as we really need to.  Granted, how often will we be needing to
> do this (not too often, I'd surmise); nonetheless, will this be
> a trend to move things out of the kernel and into userland?
> Microkernels don't work -- context switching is not cheap.
> 
> And what, pray tell, do we do if this daemon happens to get
> corrupted on the disk?  We cannot fsck beyond the root
> filesystem in single-user mode.  (Never mind that there is
> probably more wrong if the daemon is zorched.)  We now
> have to restore from tape (we do have tape, right?  Oh, my.
> No, we don't because we're doing dynamic device allocation which
> needs that daemon.  Or is this just for disks?) just to
> run fsck on our filesystems.
That's a good reason to not do dynamic device allocation. :-)
But I don't think that was a real cornerstone of the wedges idea. I guess
I'll admit that I figured that the in-kernel loader would really need to
do more than just load the root partition, but not much. i.e. I figured
that the port-default in-kernel wedge initializer would do whatever the
disklabel reading code did now. To do otherwise would really just present
folks with too huge a shock, I think. :-) So we wouldn't go quite quite as
far up the poppycock creek as above. :-)
> I don't see that we really consume all that much space by
> mapping the partition information into kernel space:
> It's, what, a dev_t, a length and an offset plus some flags?
I agree. :-)
> I'm obviously missing something here because on the surface, such
> a dynamic scheme actually does look cool (as long as we don't go
> to the graveyard that System V made!).  But how realistic is this?
I don't think you're missing anything. :-) I just think that the wedges
proposal wasn't about dynamic node generation, but about cleaning up how
the in-core partition -> offest,length info gets filled in. What we have
now was designed for BSD labels only, and has been extended for mbr
partitions, and then for non-*BSD/non-mbr labeling, without a redesign.
And it shows. :-) The wedge proposal was a way to clean it up, and support
the things folks want to do now (like recursive/extened MBR partitions).
Take care,
Bill