tech-kern: Re: 32 bit dev

Subject: Re: 32 bit dev_t, Revision 2
To: Todd Vierling <tv@NetBSD.ORG>
From: Chris G. Demetriou <cgd@pa.dec.com>
List: tech-kern
Date: 01/11/1998 16:52:07
[ this was written as quickly as possible, so there may be thinkos or
typos which appear to present an opinion that I do not hold.  that's
what y'all get for typing back and forth to each other a lot while i
was too busy to chime in. 8-]

> I have been told that one simplification of the transition method is just to
> update MAKEDEV and mknod and re-run MAKEDEV from a miniroot setup.  While
> this is easy for some people, it is not simple for everyone.  The proposal
> below includes two-way device number compatibility and allows for the
> ability to netboot NetBSD off of a NFS server with only 16 bit device
> support. 

Uh, last I saw, mandated use of opaque 32-bit tokens for device
numbers.  Isn't that true?  (I seem to recall that, but can't quote
C&V, and don't want to pull the RFC down.) that kills at least half of
your argument.


> 1. dev_t--only when _KERNEL is defined--becomes an opaque type of the
> definition:
>     typedef union { u_int32_t i; } dev_t;

Please, do this while debugging, but don't commit it that way.  While
Perry's claims about what i've said are incorrect, the spirit of them
is correct.


> 3. Our new dev_t will be split 12 bits major, 20 bits minor.  If the top 12
> bits are zeroes, the dev_t is an "old" device when considering conversion in
> the kernel.

This is fine by me.  If people want more major bits, that'd be fine by
me, too, but most of the systems I use use 12/20.

I like 20; i can easily imagine using 20 minor bits.

8 bits to select a SCSI bus, 4 to select a target, 3 to select at LUN,
5 to select a parptition.  8-)


> 4. The major device numbers will be renumbered into three blocks.  Major
> number 0 will not be used; it is reserved as a flag for "old" dev_t's.
> These three blocks will have separate bdevsw/cdevsw structures (planned to
> be merged into a devsw structure if the API for the device calls is
> rethought to include character and block distinctions in the calls).
> 
> %0xxxxxxx xxxx:  If the top bit of the major number is 0 (major 0 through
> 4095), the device is a dynamically allocated device (planned for future
> expansion in a dynamic device system and/or LKMs). 
> 
> %10xxxxxx xxxx:  If the top bit is 1 and the next bit is 0 (major 4096
> through 6143), the device is a statically numbered machine-independent
> device (anything in src/sys/dev et al.).  MI devices are kept consistent
> across all ports.
> 
> %11xxxxxx xxxx:  If the top two bits are 1 (major 6144 through 8191), the
> device is a statically numbered machine-architecture-dependent device.  MD
> devices are kept consistent across all ports of the same ${MACHINE_ARCH}.
> MD devices which are ported to become MI will receive MI major numbers, but
> their MD numbers will not be decommissioned.

First of all, this is only 4k devs, not 8k devs.

Second, it makes lookup harder for no good reason.  If we decide to
partition the device space later in this way, we can do it with few
problems.  It only adds complexity now.

There are better ways to solve the problem of having "Lots" of LKM
devices, but they don't need to be implemented now, either.

Third, well, there's no point in doing the MI vs. MD device thing.
"See my other message."  8-)


> 5. Character and block device major numbers for a given device must match.
> If a character device or a block device does not have a corresponding
> counterpart, the counterpart will be unconfigured.

No, not "character and device major numbers must match."  There should
be a unified table, which character and block devices in the same
table with a flag to tell you which types the entry supports.


> 6. When COMPAT_[09-13] is defined in the kernel, the macro major() will
> include inlined support for an old-to-new major number conversion table (one
> for block, and one for character).  Both the major() and minor() macros will
> retrieve only the proper set of bits from the dev_t depending on the top 12
> bits.

In the kernel, if the compat macros are defined, the following should
happen:

'major' should return the correct major, by either:
	(1) returning the right bits, if it's a new device node, or
	(2) doing a conversion of the right bits if it's an old device
	    node.

'minor' should return the correct minor, by either:
	(1) returning the right bits, if it's a new device node, or
	(2) doing a conversion of the right bits based on the major
	    number, if it's an old device node.

There should be cdev and bdev conversion tables, which should look
something like:

struct devsw_conv {
	dev_t	new_major;
	dev_t	(*cvt_minor)(dev_t); /* not really dev_t's, but known to fit */
};


file systems should compare devices e.g. for mounting purposes by
comapring major numbers and minor numbers.


> 7. The stat interface will be bumped a version number again, introducing
> __stat14(), __fstat14(), and __lstat14().  These will return a file's dev_t
> unchanged, or if COMPAT_[09-13] are defined, dev_t's always converted to new
> format using the old-to-new conversion table above.  mknod(2) will not be
> changed, and will always create device nodes with the numbers unchanged.

stat should always return a dev_t as a 32-bit opaque data item.

No translation of the number should happen in stat, at all.

It's be ... a royal PITA to have to worry about whether or not the
number being returned is 'real'.

Same with mknod.  no conversions, either direction.  32-bit opaque
data item.

Same in core.  No conversion of dev_t's, _EVER_.

major() and minor() might be tricky, say a dev_t has a different major
than what would appear from naive inspection, but that's no real
problem.


I've suffered enough at the hands of systems which thought they were
smarter than I am.


> 8. The old stat interfaces, if included by a COMPAT_[09-13] option, will do
> direct searchs of the old-to-new table above to demote new dev_t's to 16 bit
> dev_t's.  This can cause no-matches, which should be listed as major number
> 255.  Programs _needing_ use of the major and minor numbers of a dev_t
> should conceivably be recompiled, but this gives _some_ useful values in
> the case where compatibility is required, such as finding a process's tty
> device based on device number.  Compat routines for other OS's may also
> require this inverse mapping, or may use a "truncated" major device number.

*punt*  If other OSes only have 16-bit dev_t's, then their compat code
can implement some hack.  Otherwise, leave it alone.

What uses of programs do you think this will cause problems for?


> 9. In the kernel, any direct equality comparisons of dev_t's will be changed
> to use a new macro, isdevequal(), which does the logic of:
>     ((major(x) == major(y)) && (minor(x) == minor(y)))
> when any of COMPAT_[09-13] are defined.  Without a compat option, it will
> collapse to a binary compare.  This compare will include the old-to-new
> remapping automatically.

yes.  Very good.  I think i might perfer the name "devcmp()" though
(after timercmp()).


> 10. The definition of NODEV will change to
>     #define NODEV (u_int32_t)(-1)
> and can only be compared to a dev_t after passing the dev_t through
> devtoraw().

Sure.  It'd be better if we had a set of types to specify "32 bits or
larger" rather than mandating that dev_t be exactly 32 bits, though.


> 11. In the kernel, any need to use the dev_t value as a seed value (for hash
> tables and the like) will extract it using the macro devtoint().  This will
> provide a u_int32_t value equal to makedev(major(x),minor(x))--inlining
> conversions from old dev_t's as necessary.  This is _not_ a cop-out function
> and is only allowed in this particular context (hash values).

*punt* after above comments.


> 12. All kernel use of dev_t as an integer must comply with this API wrt
> isdevequal(), devtoint(), devtoraw(), and rawtodev().  Direct access to its
> data is disallowed, and use of devtoraw() and rawtodev() (convrting a dev_t
> to/from a raw u_int32_t) is restricted only to conditions listed in (13)
> below. 

With the exception of the comparison function, *punt*.


> 13. The only exceptions to the dev-as-integer rule are
>  - shared filesystem servers that need untranslated dev_t values
>  - testing of dev_t raw values against special cookies (VNOVAL, NODEV, etc.)
> Whether this will require a special set of stat() calls to return only raw
> dev_t's is as yet undefined.

"see above."


> 14. mknod(8) will be introduced to a new command line option to create "old" 
> style device nodes.  Possibly, mknod(8) will be modified to have an option
> to specify explicitly the number of bits used in each of the major or minor
> device numbers.

No to the former.  Yes to the latter.  Best of all, give it the option
to take an opaque 32-bit number an mknod() that.  _THAT_ would be most
useful.



> 15. The old-to-new remapping may be tunable via a sysctl, if applications or
> filesystem servers need access to raw dev_t's in the standard set of
> __stat14() syscalls, even with COMPAT_[09-13] in the kernel.  This is as yet
> undefined. 

What does this _mean_?  I get the feeling that *punt* applies to it,
as well, though.  8-)


chris