tech-kern: Re: > 1T filesystems, disklabels, etc

Subject: Re: > 1T filesystems, disklabels, etc
To: Frank van der Linden <fvdl@wasabisystems.com>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 12/14/2002 23:47:07
On Wed, 11 Dec 2002, Frank van der Linden wrote:

> Ok, so we need to attack this problem. The areas that need change can
> roughly be split in 3 parts, all relating to using 32-bit fields in
> structures:
>
>
> 	1) Disklabels. They use 32bit values.

Heh, I was talking to Jason about this yesterday.

> 	   Solution:
> 		implement wedges. It would be a waste to go for a
> 	   	disklabel with 64-bit fields, since people agree that disklabels
> 	   	should be phased out anyway.

?? Why should disklabels be phased out? Yes, I agree we want a lot more
flexability in how we handle things (we shouldn't shoe-horn everything
into one format), but disklabels are fine.

Also, is it that wedges are really great, or that they are better than
what we have now?

> 	   	So let's do it. For the original discussion (to save some
> 	   	web archive searching), see
>
> 	   	http://www.science.uva.nl/~frank/wedge.txt
>
> 	   	(website at my old job, it happened almost 3 years ago now).
> 		I'm not even suggesting that there is a discussion about
> 		whether this method is good or not -- it's generally
> 		agreed that it is. An open issue is booting. Bootblocks

Yes, but it has a big problem in my mind. How do you keep wedge numbers
consistent between boots, especially so you can set permissions on
wedges/partitions? Wire them down in the kernel config?? Also, how do you
keep the wedges assosciated with a given disk in the same place when you
boot from a different disk?

As I understand the original idea, wedges just get found, like lv's on
AIX's LVM system. The kernel's root is always wedge0, and wedges other
than root get added by a userland daemon. If disks show up in a different
order (like a drive was off-line at boot one time while on-line the
other), their partitions show up as different wedges. Say you add another
partition to a disk, then either the new partition has a wedge number
quite distant from the other wedges on the disk, or wedge numbers move
around. That's all bad.

I however think that the idea of simplifying how the kernel figures out
what partition types are on a disk and partitions there from, using a
userland daemon to cover all the bases (so say reading an Apple partition
map is possible on all machines but the kernel in most systems doesn't
need to understand), and that the kernel doesn't write
disklabels/partition maps anymore are ALL GOOD THINGS.

I think we can get what everyone wants with something more of an
intermediate step. Here's its proposal:

In a nutshell, we use a new major for disks and partitions (I'll call it
dk for now), and we support 64 partitions per device. We fill the 63
non-whole-disk partitions in in a wedge-esque manner. So we create per-
disk spaces to fill the wedges into.

So here are more concrete ideas:

1) First off, there will be a mapping method so we can go from
wd/sd/current disk partitions (major numbers) to whatever the new mapping
is. So we will have the device node legacy support that was such a big
deal when we went to 32-bit dev_t.

2) The partitions/wedges on a disk are named based on the device and then
which partition/wedge in the device it is. So we could have dk0a, dk0g,
dk1b, etc. dk0 (no partition letter) would be the whole disk, so the wd0c
vs wd0d mess goes away.

3) We fill in the list of wedges/partitions for a disk by calling a number
of partition-finding routines. Each routine knows how to understand a
different disklabeling scheme, and it fills in a number of partitions it
finds. Obviously if its scheme isn't present, it doesn't add any. The
routines would be designed to, as much as is possible, build a list of
partitions/wedges that looks like what machines using that method now get.
So for the most part, operationally, users just need to know to start
using dk#X instead of wd#X or sd#X; we get the expandability and
flexability we want w/o having to totally retrain everyone.

So let me give an example. I can think of five example disklabeling scheme
routines: ApplePartMap, amigaPart, NetBSD disklabel-be and -le, Sun
disklabel (I gather they are different subtly since you can disklabel a
sun box in a way the bootloader won't like), and (last but FAR from least)
mbr. There are others too, this list is just meant to be an example.

What they would do:

a) NetBSD disklabel -be and -le. They would be the same core code, just
one would use big-endian access, and the other little. Oh, actually there
would be four. For each byte order, there would be one that looks in
sector 0 of the "disk", and one that looks in sector 1 (reflecting
LABELSECTOR now).

If we found a NetBSD disklabel of the given byte order in the mentioned
sector, we would grab sixteen partitions out of the disklabel pool, and
fill them in from the disklabel structure we found. We would grab a fixed
number of partitions/wedges so that we can add more later w/o renumbering
subsequent ones.

b) ApplePartMap. This code would grab either 16 or 32 partitions/wedges
(not sure which number would be better), look at the Apple partitions
present, and list the ones we find in an order similar to what macppc and
mac68k do now.

c) amigaPart. Grabs like 16 partitions (I think that's an ok #) and fills
them in like now.

d) sun disklabel (if it is different at this level), grabs 8 partitions
and reads from the sun disklabel.

e) mbr. I saved this one for last since it'll be the most complex.

The mbr code will take the wedge scan-and-fill-in-what-you-find idea one
level farther. It will look for specific mbr types, and if present will
fill in partitions/wedges as appropriate.

So the first thing it will do is look for a NetBSD mbr type. If it finds
one, it will call the NetBSD-le-LABLESECTOR=1 code, and read in the NetBSD
disklabel. This would take 16 partitions/wedges.

Next it looks for an Old NetBSD mbr partition, which is also a FreeBSD
partition. It grabs either 8 or 16 partitions (we only used 8, did FreeBSD
use 16?), and fills it in as per NetBSD-le-LABLESECTOR=1.

Next, if we want, it looks for an OpenBSD mbr partition, and fills it in
as above.

Then it grabs one partition for each mbr partition not used above, and
fills it in. This will give us Windows and Linux partitions.

If it finds an extended partition, after adding it above, it itterates
over the contents, and adds partitions as appropriate.

The idea of the above is that what most users will notice is that things
are like they were before, only better. All the NetBSD stuff is in the
same place, but (say on mbr) other-OS partitions just show up.

4) Each port would have a default search order through the different
partitioning schemes, and this would be obtainable as a sysctl (so we
don't have to compile the list into say libc, and so we can share binaries
within an arch).

The list would be crafted to reflect the current disklabel search scheme
each port has now. For instance, MacPPC would look for NetBSD-be-LABEL==0,
then ApplePartMap, then MBR. i386 would look for MBR, then look for
NetBSD-le-LABEL==1. The idea with this list is to generate disklabels that
look like they do now (at least for the first partitions/wedges; chances
are this method will find more than our current code).

In addition to the above list, there would be a default NetBSD search
order list, which we consult after the port-specific one. It would provide
one place where you would add new partitioning schemes, rather than into
each port's list. Obviously if you add partitioning scheme X for port X,
you'd add it to both the NetBSD list and to port X's port-specific list.

Duplicates between the port-specific list and the NetBSD general list
aren't searched for twice.

5) Oh, just to make sure it's clear, we itterate over ALL of the
partitioning schemes. So for instance, a NetBSD box will find both an
Apple Partition Map and an mbr labeling, if they both are present (which
was true for some Mac&PC Zip disks).

6) There would be a userland daemon that could scan _all_ the partitioning
method possabilities on a disk and fill in the wedge/partition list for a
disk, when requested by the kernel.

7) The kernel would know about a few of the schemes to facilitate booting.
A kernel config could pull in partitioining scheme methods, just like
we pull in file systems. The idea is the code would be sufficiently
abstracted that the same disklabel/parititioning scheme groking code could
be in the kernel and in the userland lib; we wouldn't have two versions of
the code floating around.

8) If you wanted to, I think it'd be fine for a kernel config to be able
to override the port-specific list. So if you really wanted a different
search order, you could easily get it. And since userland gets the list
from the kernel, it would automatically adapt.



Getting back to wedges:

So I think that while the wedges idea has some good points, and this code
NEEDS redesign, the wedges idea I've heard has two glaring holes (or at
least undetailed areas):

The first problem is one of naming. Unix is used to refering to devices by
name. If wedge numbers aren't wired down, what say would /etc/fstab look
like?

If we were desigining everything from scratch (or making something like
AIX/OSF's LVM; something where we can reasonably say forget backwards
compat), we could add locators in the dislabeling so we could assign a
persistent name tied to a partition, so we wouldn't have this problem. But
we aren't, so I don't see how we can do that.

We could wire down wedges, but I 1) don't really see how, and 2) that
would be a pain. Folks like lots of partitions (see the upping of the
number of i386 partitions), so having to wire down wedges would add #
partitions/wedges * # disks lines to a config file. Plus, adding a
partition would mean, to keep it in a wired place, that you have to
recompile your kernel. Folks are used to wiring down disks; let's not make
it worse.

The other problem I was going to talk about is how do you persistently
keep the permissions on a wedge/partition. But if we solve the above name
problem and we have real inodes in /dev, this problem's solved.

If you can come up with a way to make wedges address my concerns, I'll
hush. But I've thought about it a lot (we've been talking about this for
years :-) and I've not come up with anything.

Also, in terms of work needed, both this idea and wedges need code to
understand all our partitioning schemes. Wedges however also needs code to
address the naming concern above, which is more work. Both schemes need a
way to backwards-support old block inodes, which I think will be more work
for wedges as wedges is a diferent paradime.

Thoughts?

> 		must find the filesystem and load the kernel, and the
> 		kernel must find it's root filesystem. This may be
> 		platform dependent. I want a list of platforms and how
> 		we address this issue for this platform.
>
> Now as usual in a volunteer project, the issue is: who is going
> to do this. I know that my plate is pretty full, but I can
> coordinate this (and be a responsible Core member for a change..).
>
> I suggest following up on the above issues, make a list of things
> to do, and start a CVS branch.

Well, thems are my thoughts. I might be able to lend a minor hand,
especially w/ the partitioning stuff.

Take care,

Bill