Subject: Re: Cloning bdev/cdev devices, step one
To: Bill Studenmund <wrstuden@zembu.com>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 07/09/2000 10:10:05
On Fri, Jul 07, 2000 at 01:33:26PM -0700, Bill Studenmund wrote:
> On Thu, 6 Jul 2000, Chuck Silvers wrote:
> 
> > this sounds like a fine thing to me.  making devices less tied to
> > dev_t internally in the kernel is a good thing, since dev_t is really
> > just a kludgy way of representing a device in a filesystem.
> 
> Uhm, how is it kludgy? Hasn't it been that all unix has needed to
> represent/specify a device has been is the dev_t? I mean, hasn't it been
> that that's been the canonical specifier? :-) If dev_t is the canonical
> specifier, how is using dev_t in the kernle a kludge? :-)

it's not dev_t in the kernel that I'm complaining about, it's dev_t in
general as a way to uniquely identify a device.  I mean, originally dev_t
was a 16-bit value, and expanding it to 32 bits was a source of much
discontent when that happened in netbsd a while back.  your example of
changing the number of paritions in a scsi device is another case where
the limitations of dev_t cause strife.  some devices also chop up the
minor number in funny ways, like tapes have no-rewind bits, etc.
the whole thing just seems very clunky.

however, since dev_t is part of various standard unix APIs, we're stuck
with it in interfaces.  but internally we can try to minimize the impact.


> > I wonder thought, is there any value in providing the glue to be able
> > to refer to the cookie as a vnode field?  v_rdev looks to be there
> > more for backward compatibility than anything else, but this is something
> 
> It's been there since the initial import from Berkeley, so we'd need their
> source control logs to see the real history. But I think it means
> "real" device, as opposed to the dev_t on disk.
> 
> 99.9% of the time, the two are the same. But if we go to 64 partitions per
> disk (using new major numbers), then they won't necessarily be the
> same. I expect that what we will do is map the 8/16 partition devices to
> the equivalent 64 partition device. So for things like sd0a, the inode
> will still have the 8/16 partition major, while the vnode's v_rdev will
> have the 64-partition major number. That way the right thing happens if
> there's an sd0a on disk with the 64-partition major number.

so how would you make a device node on disk to represent the 34th partition?

in stat(), st_dev vs. st_rdev distinguishes the device that a device node
lives in vs. the device that the device node refers to.  but that has
nothing to do with vnode fields.  I'm only talking about the in-kernel
data structures here, not what's visible to applications.

btw, I would hope that we will get away from the current notion of a fixed
number partitions per disk at some point and move to a more volume-oriented
design. I imagine raidframe does something reasonable here but I haven't
looked at it yet.  clearly some environments (eg. embedded stuff) won't
want the overhead of a volume-management system, but others (eg. ISPs)
would gladly pay the price for the increased flexibility.  I think what
I mean here is that our default install should create filesystems in
volumes (or whatever raidframe calls them) rather than directly in
device partitions, or at least have an option to do this.
but this is a separate discussion.


> Also, I suspect that the /dev/console vnode used to have the real console
> device's dev_t shoved into its vnode. Nowadays, our console driver just
> hangs onto a vnode with the right dev_t in it. Note: this is all done so
> that sys_revoke() works right.
> 
> > new so there is no previous name to be compatible with.  the same would
> > have held for the other v_ aliases defined along with v_rdev, but I guess
> > whoever was doing that was trying to be consistent.  my take on that
> > is that it's confusing the namespaces of vnodes vs. devices and it would be
> > better to not pretend those are vnode fields, but that's pretty subjective.
> 
> I've dug into all of this fairly deply, to get layered device nodes to
> work. And it does make sense. :-) Think of them as vnode fields which are
> only valid if you have a device node. Anything in the vfs systems which
> sees a character or block vnode knows that these fields are there. They
> are also vnode fields in that they are a public interface to the node (as
> opposed to the fs-specific private stuff).
> 
> :-)
> 
> Note: if we do put the cookie in the vnode, I think it should go in struct
> specinfo, and get a v_devcookie define too. Mainly because it helps memory
> scaling. We only need these fields for devices. With them in struct
> specinfo, we only allocate space for them for each seen device. If we put
> it in struct vnode, then we allocate that space always. I haven't done
> counts, but I expect most systems to have a LOT more vnodes than device
> vnodes. :-)

yes, I understood that the new field would really live in struct specinfo.
my point is just that since the device info is available as
vp->v_specinfo->si_foo, what's so great about referring to it as vp->v_foo?
everything that would use vp->v_foo already has to make sure that it's
operating on a device vnode, so why clutter the namespace?

-Chuck


> > perhaps it would be useful to sketch out an example of how this scheme would
> > work so that the details would be clearer.  I know that matt thomas was
> > advocating a different scheme (and perhaps other people have other ideas),
> > and if we had examples of how each of them would accomplish their goals
> > we'd have a better notion of their benefits.
> 
> I'm awaiting Jason's sketch too. I really like the idea of being able to
> add ccd's on the fly (the part shown so far). I'm a bit worried about
> swapping out vnodes in upper layers (since we'd have to either special
> case certain device major numbers, or we'd have to be passing struct
> vnode ** into VOP_IOCTL() so the device could do it), but it might work
> well.
> 
> Take care,
> 
> Bill