Subject: Re: Partition tables (was: Re: Another changer, another changer
To: David Holland <dholland@cs.toronto.edu>
From: Shyeah right. What am I gonna do with a gun rack? <greywolf@starwolf.starwolf.com>
List: current-users
Date: 10/19/1998 11:04:17
David Holland sez:
/*
 * Note to those following along: this message rambles for a while and
 * actually has something approaching a proposal at the bottom, so you
 * don't necessarily want to skip over it.
 * 
 *  > Idea for the partition table thing:
 *  >  [...]
 *  >
 *  > We could actually have a single device to which you send requests,
 *  > i.e. /dev/diskpart, and you do something like
 *  > 
 *  > struct diskaccess da[1];	/* contains a struct diskpart */
 *  > 
 *  > da->disk="sd0";
 *  > 
 *  > fd=open("/dev/diskpart", O_RDWR)
 *  > ioctl(fd, DKIOCGDTAB, &da);
 *  > 
 *  > and the diskpart driver would automagically handle the routing.
 * 
 * Wait, which problem are you trying to solve? There are at least the
 * following related issues involved here:
 * 
 *   1. getting disklabel handling code out of disk device drivers
 *   2. supporting multiple types of disklabel/partition table
 *   3. probing a disk for disklabel(s)/partition table(s)
 *   4. organizing/numbering the partitions found for presentation
 *      to higher layers of the system
 *   5. mapping major and minor device numbers to disks and partitions
 *      and/or presenting partition names in /dev
 *   6. mapping disk names to individual pieces of hardware
 *      (this is really mostly the other thread though)
 *   7. providing better support for editing disklabels.

8.  Not making the disklabel access bound to a physical partition.

 * I think you're worrying about (7) though, and I think that's not even
 * really a problem - in all of the solutions proposed so far there's
 * either a whole-disk "partition" like current practice, an independent
 * device node for the whole disk, or a special "partition" that holds
 * just the partition table/disklabel.

Exactly wrong.  I'm trying to find a solution that will let us do the
disklabel thing without wasting a partition.

 * 
 * The basic problem in (5) is that you have hardware organized something
 * like this:

The basic problem with this whole thing is the mixing of UN*X/DOS
partitioning with which we regrettably must deal.

 *          wd0
 *           |
 *           |-fdisk table
 *              |--partition1                  (#1)
 * 	     |--partition2                  (#2)
 *              |     |
 *              |     \-fdisk table
 *              |        |--partition1         (#3)
 *              |
 * 	     |--partition3                  (#4)
 *                    |
 *                    \-bsd disklabel
 *                         |--partition a      (#5)
 *                         |--partition b      (#6)
 *                         |--partition c      (#7)
 *                         |--partition d      (#8)
 *                         |--partition e      (#9)
 * 
 * ...and a similar mess on each of wd1, sd0, sd1, sd2, etc.
 * 
 * How do you name these? There's some precedent for how one should
 * handle the naming of nested fdisk tables, but not much, and it's not
 * clear what you're supposed to do if there's more than one tree of
 * nested fdisk tables. There's not really any precedent at all for
 * naming the stuff listed in the bsd disklabel *in its full context*.
 * And what do you do if you find a Mac disklabel nested in there
 * someplace?
 * 
 * There are two obvious solutions: one is to choose an order of tree
 * descent and number the partitions in that order. The other is to use
 * hierarchical naming, that is, one number at each level, so you'd have
 * something like wd0/1, wd0/2/1, wd0/3/[a-e], etc.

Get rid of the slash and zero fill the disk number?  I don't know,
there's got to be a more elegant way to handle this.  I don't think I'd
be averse to /dev/wd003a as a device name (wd0, partition 3, subpartition a).

There's got to be a limit here, though; doing fdisk partitions three
times over, what's the point?  Twice makes some sense.

 * 
 * The problem with numbering in order is that splitting a partition
 * someplace renames all the ones "after" it. The problem with
 * hierarchical naming is that backwards compatibility for x86 disks with
 * nested bsd disklabels becomes difficult: identifying what partition
 * used to be, say, wd0c, so you can make a symlink, is a pain. And also,
 * hierarchical naming of this sort isn't very compatible with major and
 * minor device numbering.

Well, if we enforce a depth limit (I know, I know, "dynamic is better
than static; don't limit it statically if you don't have to;", etc.,
but it _is_ a thought) of two down, unless you can give a really practical
argument for disk paths like wd0/2/3/1/2/2/3/a...

 * 
 * Then when you start looking at multiple disks it becomes an even
 * bigger nuisance. If you have the partition stuff set up as a driver,
 * you don't really want multiple major numbers for it. In fact, what you
 * really want is for the major number to correspond to the partition
 * table type, because that's the proper way to choose which driver to
 * invoke.

This is logical...

 * 
 * Hmm. Maybe the right thing to do is to collect all the partitions of
 * each type together. Then you know how many partitions per table you
 * can have, so you can assign minor numbers in some sensible manner.
 * Then the wd0 drawn above would give you (let's assume fdisk is the
 * fdisk table device (major 101), dk is the bsd disklabel device (major
 * 100)):

I hope you're not suggesting hard-wiring these -- the config file would
get huger than it is (for the i386 port -- the SPARC port is pretty
reasonable)!

[device table deleted]
 * 
 * (*) These devices would return EBUSY on open because they're in use by
 * other instances of disklabel drivers, unless none of the partitions
 * were open. Or maybe not - it depends if you want a minimal foot-guard
 * on your gun or not. :-)
 * 
 * 
 * This is actually starting to look viable. Do people care if you can't
 * tell by looking at the device name which disk it's on? In a sense it's
 * not different from not being able to tell what physical disk sd0 is by
 * looking at the device name - you have to look at the device probe
 * output.

I see the following as criteria for NetBSD:

	- we need to be able to continue referring to a disk partition as
	  {,r}${type}d${unit}${part}; it appears that this is definitely
	  the _preferred_ naming convention by many, including myself.
	  I could get Solaris 2.6 for free for my IPX, and I know where
	  to find it.  I'm choosing to run NetBSD because it is a nice
	  comfortable system for me to use.  Having seen the internals
	  to SVR4, I can honestly say we're doing a much better archi-
	  tecture job than they did.

	- we need to find another way of actually accessing the disklabel
	  in order to avoid wasting a partition for this purpose.  There
	  are many ways which have been suggested, any or none of which
	  may prove feasible.  It has been brought to my attention, for
	  example, that my scheme lacks sufficient protections to prevent
	  or authorize proper access to the disklabels (they're not all
	  removable nor fixed).

	- we need to have a partition table larger than 8 partitions
	  due to the larger disks that are available.  I would opt for
	  16 because exact powers of two seem to fit much better into
	  things which are ostensibly bit-field oriented in the first place.
	  It avoids waste in the numbering space, and it avoids potentially
	  expen$ive computations (straw man?  I don't know, but i think
	  it's less expen$ive to figure a division by sixteen than by
	  N + something - something else...)
 * 
 * I suspect in the long run the only correct way to identify a disk
 * volume is by some kind of volume name or serial number actually stored
 * on the disk. That's in a sense a separate issue though.

It can also get kind of convoluted:

	/dev/rsd1594-993Aa
		or even
	/dev/rdsk/1594-993A/0

doesn't seem to make much sense.

[This may get a bit tangential...]

We're trying to at least keep the device (and LUN?) numbering wired to
the physical device, or so I thought.  The only one we really can simply
access is the jumper on the peripheral itself.  If I have disks drives
number 0, 1, and 3 hooked up, I expect them to show up as sd0, sd1, and
sd3, for example.  I have to wire this down, currently, but that is, for
the moment, entirely beside the point.  If I toss in a disk with its
SCSI address set to 2, I expect sd2 to show up.

{For another discussion:  Should there be a way to say

	disk sd$1 at scsibus? flags 0x0

such that as the disk is found, its address becomes its ID?  i.e., if,
by wildcarding disk sd* it encounters units 1, 2, and 5 (never you mind!)
it will automagically attach them as sd1, sd2, and sd5 instead of sd0,
sd1 and sd2?}

I really don't see an advantage to trying to divine which controller
is which number; I, like anyone else who has installed a second (third)
controller onto a nice machine like a SPARCstation (or worse, yet,
a Sun3/1xx or 2xx (VME cage) machine), have encountered the phenomenon
that the onboard ESP is really at sbus0 slot *3*, and since it probes
the SBUS slots in (some) order (I never noticed the labels), the disks
on the new controller suddenly became sd0, sd2, sd3... and like that.

The only thing I ran into, though, was that I had to boot -a from the
correct disk (which was easy enough to figure), rebuild a kernel and
go from there.  If you're running a stable system, you don't need to
do a "make depend" on much else besides param.c and param.h -- the
objects are already there, for the most part.  I think it built, what,
six objects from source and used the rest of what was there.

Yes, this was StunOS, perhaps not the best example, but you get the
idea.  Migrating to /dev/dsk/c0t0d0s0 and /dev/tape/c0t4hn is NOT
going to truly address the issue of migrating disks.

Unless you KNOW your hardware inside and out, which is, in and of itself,
not a bad idea but shouldn't be always necessary, there is NO way
you're going to figure out which controller a device sits on.

And regarding disk controllers, adding a new major number for every
disk or every disk type seems excessive, and adding a new minor number
for every disk seems just plain sloppy.  We need an interface that will
let us talk to the disk label without having to talk to a partition
proper.

Okay, I'm done -- next rambler?

 * 
 * -- 
 *    - David A. Holland             | (please continue to send non-list mail to
 *      dholland@cs.utoronto.ca      | dholland@hcs.harvard.edu. yes, I moved.)
 */





				--*greywolf;
--
"This is obviously a definition of the word 'safe' with which I was
not previously acquainted." -- Arthur Dent, upon discovering that they were
aboard a Vogon starship.