Subject: Re: devfs, was Re: ptyfs fully working now...
To: Robert Elz <kre@munnari.OZ.AU>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 11/27/2004 15:33:12
--uxuisgdDHaNETlh8
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Nov 26, 2004 at 07:10:43PM +0700, Robert Elz wrote:
>     Date:        Fri, 26 Nov 2004 04:14:56 -0600
>     From:        Eric Haszlakiewicz <erh@nimenees.com>
>     Message-ID:  <20041126101456.GA4268@diana.nimenees.com>
>=20
>   | 	Why not?  What's wrong with having the kernel write out timestamp
>   | data along with all the other info about a device node?
>=20
> atimes are updated frequently - none of the formats I've seen mentioned
> (and I am still am > 1200 messages behind current on list mail, so
> I might still be missing things) are really amenable to that happening
> a lot.   Changing owners/permissions/... of a device probably happens
> once a month at most sites (if that) so it doesn't really matter how
> efficiently that kind of update happens to the permanent storage method.

I think the binary format can deal with this.

Also, while we may have atimes and mtimes update frequently, how often do
they really need to be written to disk? The on-disk info really only
matters for the next reboot, yes? I'd expect that a once-a-minute or once
every ten minutes update devfs flush-to-disk would be fine. If we want, we
could even rig things so that the "first" a or mtime update after boot
gets flushed "quicky" and all the others are delayed. That'd be for a=20
scenario where we want the next boot to be able to tell there was an=20
access within "X" minutes of a crash.

I agree that some sort of daemon-rewriting-a-file approach would need to=20
be written carefully to be able to handle atime & mtime changes.

>   | A few rationales for devfs:
>   | *) Reduce admin work to add a new device
>=20
> I have no problem with the kernel scanning /dev as it starts, then
> creating entries for any device it finds that isn't there already.
> If you're going to have a dynamic devfs type thing, then you're
> obviously already going to have to have the kernel know suitable names
> to use (which it doesn't need now), and obviously the kernel is already
> able to create device inodes (it does now in response to a mknod() sys
> call) so nothing particularly new is needed here.

There is the complication that you have to do this scan after the root=20
file system has been mounted r/w.

>   | *) eliminate device major/minor allocation issues (esp. useful for th=
ird
>   | 	party modules that might want to create devices)
>=20
> Changing from major/minor to something else is pretty much independent of
> how the user file name (device file name) -> kernel mapping gets done,
> which is what devfs is all about really.   That's one level lower.   I do=
n't
> care what is in a device inode to link it to the driver.

Yes & no. We have to be able to store in the node the info needed to be=20
able to reestablish the linking. block and char dev nodes can only hold=20
dev_t's worth of info. So to be able to map to something other than major=
=20
& minor #s, we have to use something other than on-disk dev nodes.

> I know that you mean for it to be dynamic, but if you're going to have
> any kind of preserved properties (over reboots) for devices, then you have
> to have some kind of stable naming scheme.   Whatever that is can simply

I agree some stable naming scheme (what I've refered to as locators in=20
other notes) is needed.

> be used in a device inode as far as I'm concerned (changing from major/mi=
nor
> is going to have all kinds of upgrade and portability issues - what to
> put in pax (or cpio) files for devices, handling old dumps, ... but that
> can all be overcome, somehow).

We haven't figured out all of the pax/cpio issues, but I expect we'll be
able to think up something. One update tool I think we will need is one
that takes a current /dev (or chroot's /dev) and generates a devfs data=20
file (of whatever is chosen). So for a restore, you can restore to a=20
non-root disk (say you boot install media then restore to the real disks),=
=20
then run this tool to rebuild your devfs file.

The question you're describing comes down to do we have one file for
everything or one "file" per device node. I think one (binary) file for
everything is best. Binary so that the kernel doesn't have to spend much=20
effort reading or writing the file.

I admit part of the reason I like the binary file is that I worked on dmfs
(Data Migration File System) while at NAS at NASA/Ames. We used layered
file systems (a variant of overlayfs to be exact) to implement a tertiary
storage system. When "large inode"  support for ffs was shot down, we used
a binary file indexed by inode # to track stat info and other residency
information for files. Using this simple db, we were able to make the data
mover operations invisible to userland applications (other than the
potential delay due to tape motion issues for restore). So I've done this,
and it seemd to work well.

Another reason for using an overall file as opposed to nodes in the root=20
file system is that I'd like us to support having information about two=20
different devices with the same name in the system at once. Obviously only=
=20
one device should be active at once, and the locators would have to be=20
structured such that we can tell the devices apart. However what I really=
=20
want is that a device with name "foo" showing up once will not destroy the=
=20
access/ownership/locator info for a previously-identified device "foo". As=
=20
node name is essentially a primary key in a directory, we can't do this=20
easily if we use on-disk nodes as part of the data store.

>   | *) it allows more flexible device namings (e.g. to better support wed=
ges)
>=20
> Huh?

We're tired of how we currently handle device partitioning. We find it=20
very lacking, and that it will have difficulty growning in the future. To=
=20
address this, we are moving towards "wedges", which are essentially named=
=20
partitions off of disks. Jason has a beginning implementation in -current=
=20
now. Wedges will really need something like devfs to keep permissions=20
correct.

Take care,

Bill

--uxuisgdDHaNETlh8
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)

iD8DBQFBqQ64Wz+3JHUci9cRAmh6AJ4tlpQly2lBSOn8h41xnT3mnrLAOACfRZ04
kAInnJDvyPLoEma9aNbPdiE=
=tPyg
-----END PGP SIGNATURE-----

--uxuisgdDHaNETlh8--