tech-kern: Re: Work-in-progress "wedges" implementation

Subject: Re: Work-in-progress "wedges" implementation
To: Jason Thorpe <thorpej@shagadelic.org>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 09/22/2004 15:11:04
--PmA2V3Z32TCmWXqI
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Sep 22, 2004 at 01:26:34PM -0700, Jason Thorpe wrote:
> Wedges are a new way of representing disk partitions in the NetBSD=20
> kernel.

Cool.

> The basic idea is to decouple the internal representation of disk=20
> partitions
> from the on-disk representation.  Currently, the NetBSD kernel uses=20
> "struct
> disklabel" (a.k.a. BSD disklabel) for both in-core and on-disk=20
> representation,
> and operates on this structure exclusively.
>=20
> The main problem is that some platforms use (by necessity) on-disk
> representations other than the BSD disklabel.  This is generally to
> maintain compatibility with another OS on the platform (e.g. Mac OS on
> a Macintosh), or because the system firmware understands a particular
> format (e.g. Sun PROMs understand Sun disklabels).
>=20
> In order to handle this "other format", individual platforms may support
> an alternative on-disk representation.  In the kernel, this is=20
> represented
> by "struct cpu_disklabel".  Unfortunately, there are drawbacks to this
> approach:
>=20
>         - Cross-platform disk portability is basically non-existent.
>=20
>         - The BSD disklabel cannot represent all of the pertinent
>           information of some other on-disk representations, and
>           vice-versa.  This includes number of partitions and
>           partition names.

There are also issues like how i386 & friends have a layered partitioning=
=20
scheme where our stuff is in an mbr partition.

> Another problem is the fact that the BSD disklabel uses 32-bit fields
> for block numbers.  This means that the largest disk that the BSD=20
> disklabel
> can describe is 2TB, which is not terribly large by today's standards.
>=20
> Finally, in a world with hot-plug busses where devices may appear and
> disappear at any time, deterministic disk probe ordering does not exist.
> The old-fashioned disk naming scheme is not very usable in this=20
> scenario.
>=20
> Wedges solves these problems in the following ways:
>=20
>         - Disk partitions are represented in the kernel as separate
>           block devices, and there can be an arbitrary number of these
>           associated with a disk.  Each wedge internally uses 64-bit
>           block numbers to support partitions > 2TB.

Cool.

>         - Wedges includes a modular partition discovery framework,=20
> allowing
>           different partition formats to be supported seamlessly on all
>           platforms.  A module for the EFI GUID Partition Table (GPT)
>           format, which includes arbitrary numbers of partitions, 64-bit
>           block numbers, and Unicode partition names, is included.

Cool. I'll check this out.

>         - Wedges may also be configured using ioctls from user space,
>           allowing partition handling to be pushed out of the kernel,
>           if desired.
>=20
>         - Wedges are "named".  That is, each wedge has an associated
>           name encoded in UTF-8.  This name can be used to create a
>           device node in /dev to decouple the wedge's identity from
>           its probe-order-dependent unit number.  Duplicate names are
>           suppressed, and partition discovery modules can try alternate
>           names in the event of a collision.  For example, the GPT=20
> module
>           may try the Unicode name associated with the GPT partition,=20
> and
>           of that already exists, it may try again using the string
>           representation of the partition's GUID.
>=20
>         - Wedges represent partition types as strings, allowing for
>           arbitrary partition types.

Do we have support for partition locators? Like "the 6th partition in the
BSD disklabel in the 2nd mbr partition"? (obviously I'll look :-) With
them, we could make a userland database that knows where each partition
has been seen, and so we can keep user access permissions (and ACLs)=20
constant in face of repartitioning. Since GPT is a flat partitioning, it=20
won't need them as much. I also understand this is a WIP, and this can=20
come later.

Oh, still looking. But the locators would be important only for whatever=20
wants to "add" wedges. So most of the code (all the diff I've looked at so=
=20
far) won't care. And only code that keeps things the same across boots=20
will truely care, so the kernel level may not care, other than carry=20
around another ascii string for the wedge.

> The wedges implementation is a work-in-progress at the moment, designed
> to allow for the use of old-style disk naming while wedges are still
> under development.  Features of the current wedges implementation:
>=20
>         1. More items are moved from individual disk softc structures
>            into "struct disk".  Among other things, this allows for
>            information sharing and better synchronization between
>            wedges and their parent disks.
>=20
>         2. I/O is enqueued on the wedge and a new buf allocated in order
>            to perform I/O on the parent.  This is a transitional=20
> measure;
>            I would like to eventually make it possible for disk drivers=
=20
> to
>            operate directly on the buf provided to the wedge.
>=20
>         3. Once wedges are created on a disk, I/O to that disk may only
>            be performed through its wedges, or on the disk's RAW_PART.
>            Wedges may not be created on a disk if any partition other
>            than RAW_PART is open.

In the long run, I think we'll need to do something different here. I=20
liked the original wedges idea of you just have enough wedge info in the=20
kernel to boot, then you let userland find all the wedges. The implication=
=20
of this is that we would want to add wedges to the boot disk after / has=20
been mounted. But that can be fixed later, when we add some sort of code=20
to check for overlaps/errors.

>         4. A minphys entry point is added to "struct dkdriver". =20
> Eventually,
>            I would like to fully utilize "struct dkdriver" as the=20
> interface
>            to a disk from a wedge, rather than using a vnode.  Once we=20
> are
>            fully transitioned to wedges, I would like to see the=20
> traditional
>            entry points to disk drivers go away, with the exception of=20
> an
>            entry point for the raw disk, so that partitions may be=20
> created
>            on it.
>=20
>         5. My patch includes modifications to make wedges work with the=
=20
> "wd"
>            driver.  I will convert the other disk drivers over time.  An
>            outstanding question: What should we do about floppy drives?

I'd say leave them alone. We don't support partitioning them; the letters=
=20
are just the formatting density.

>         6. I have modified fsck and mount to use the partition type=20
> names
>            that wedges provide.  Conveniently, I have defined names that
>            match the fsck_* and mount_* names for the various partition
>            types that indicate file systems.
>=20
> Known issues:
>=20
>         1. You can't currently newfs a wedge.  This is because newfs
>            requires the old-style DIOCGDLABEL ioctl, which wedges do
>            not support.  I am working on a means for exporting the
>            parent disk's geometry through the wedge, which is what
>            newfs wants.
>=20
>         2. Related to (1), what to do about the block size / frag size
>            entries in "struct partition" (part of "struct disklabel",
>            and this antiquated and obsolete and not part of wedges)?

I'd say keep them as much as we can. Obviously if the on-disk labeling
won't support the extra info, we can't keep it. But to the extent we can
keep this stuff, I'd like us to. I know that I've saved my butt on
occasion (ok, twice) from having this info in the disklabel. Yes, it's not=
=20
a cure-all; if the disk dies or the part map gets scribbled all over, we=20
lose. But it helps.

I think some way to extract the geometry plus some way of describing the=20
partition itself. For the latter, we talked a bit and I was thinking=20
something like:

struct wedgeinfo {
	uint64_t	wi_offset;
	uint64_t	wi_size;
	union {
		uint32_t fsize;
		uint32_t cdsession;
	} _wi_u4;
#define wi_fsize	_wi_u4.fsize
#define wi_cdsession	_wi_u4.cdsession
	uint8_t 	wi_fstype;
	uint8_t 	wi_frag;
	union {
		uint16_t cpg;
		uint16_t sgs;
	} _wi_u2;
#define wi_cpg		_wi_u2.cpg
#define wi_sgs		_wi_u2.sgs

Yes, that's binary-compat with the partition-extra stuff we have in struct=
=20
disklabel now.


So back to the newfs question, I think newfs wants an ioctl to read the=20
disk info and the part info. They could both be loaded into some uber=20
wrapper struct. Then it wants a function to "write" the part info.=20
Obviously the size and offset woudn't actually be writable.

I'm glad you're working on this, as I'm actually working on making it so=20
that we support changing fs type on an Apple PartMap. The "ioctl() to=20
write struct partition" trick is what I was going to use there, and then=20
add code so that the kernel could write the change back. I want to move=20
away from all of the "write disklabel" stuff for the reasons that motivate=
=20
wedges above. So this change is all in the right direction. :-)

> I would like to get "wedges" checked into the tree to allow for greater=
=20
> collaboration on it.  Since it does not interfere with the use of disks=
=20
> through the traditional interface, I don't think it's necessary to put=20
> this on a branch.

Sounds like an excelent idea!

Take care,

Bill

--PmA2V3Z32TCmWXqI
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)

iD8DBQFBUfh4Wz+3JHUci9cRAmgvAJ9j9kq/WN2eBV2nJrffTcpLCwfusgCgh+y0
P0l9edR0RV4aqdtUMZNi1dU=
=jABM
-----END PGP SIGNATURE-----

--PmA2V3Z32TCmWXqI--