Subject: Re: Work-in-progress "wedges" implementation
To: Jason Thorpe <thorpej@shagadelic.org>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 09/22/2004 15:11:04
--PmA2V3Z32TCmWXqI
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
On Wed, Sep 22, 2004 at 01:26:34PM -0700, Jason Thorpe wrote:
> Wedges are a new way of representing disk partitions in the NetBSD=20
> kernel.
Cool.
> The basic idea is to decouple the internal representation of disk=20
> partitions
> from the on-disk representation. Currently, the NetBSD kernel uses=20
> "struct
> disklabel" (a.k.a. BSD disklabel) for both in-core and on-disk=20
> representation,
> and operates on this structure exclusively.
>=20
> The main problem is that some platforms use (by necessity) on-disk
> representations other than the BSD disklabel. This is generally to
> maintain compatibility with another OS on the platform (e.g. Mac OS on
> a Macintosh), or because the system firmware understands a particular
> format (e.g. Sun PROMs understand Sun disklabels).
>=20
> In order to handle this "other format", individual platforms may support
> an alternative on-disk representation. In the kernel, this is=20
> represented
> by "struct cpu_disklabel". Unfortunately, there are drawbacks to this
> approach:
>=20
> - Cross-platform disk portability is basically non-existent.
>=20
> - The BSD disklabel cannot represent all of the pertinent
> information of some other on-disk representations, and
> vice-versa. This includes number of partitions and
> partition names.
There are also issues like how i386 & friends have a layered partitioning=
=20
scheme where our stuff is in an mbr partition.
> Another problem is the fact that the BSD disklabel uses 32-bit fields
> for block numbers. This means that the largest disk that the BSD=20
> disklabel
> can describe is 2TB, which is not terribly large by today's standards.
>=20
> Finally, in a world with hot-plug busses where devices may appear and
> disappear at any time, deterministic disk probe ordering does not exist.
> The old-fashioned disk naming scheme is not very usable in this=20
> scenario.
>=20
> Wedges solves these problems in the following ways:
>=20
> - Disk partitions are represented in the kernel as separate
> block devices, and there can be an arbitrary number of these
> associated with a disk. Each wedge internally uses 64-bit
> block numbers to support partitions > 2TB.
Cool.
> - Wedges includes a modular partition discovery framework,=20
> allowing
> different partition formats to be supported seamlessly on all
> platforms. A module for the EFI GUID Partition Table (GPT)
> format, which includes arbitrary numbers of partitions, 64-bit
> block numbers, and Unicode partition names, is included.
Cool. I'll check this out.
> - Wedges may also be configured using ioctls from user space,
> allowing partition handling to be pushed out of the kernel,
> if desired.
>=20
> - Wedges are "named". That is, each wedge has an associated
> name encoded in UTF-8. This name can be used to create a
> device node in /dev to decouple the wedge's identity from
> its probe-order-dependent unit number. Duplicate names are
> suppressed, and partition discovery modules can try alternate
> names in the event of a collision. For example, the GPT=20
> module
> may try the Unicode name associated with the GPT partition,=20
> and
> of that already exists, it may try again using the string
> representation of the partition's GUID.
>=20
> - Wedges represent partition types as strings, allowing for
> arbitrary partition types.
Do we have support for partition locators? Like "the 6th partition in the
BSD disklabel in the 2nd mbr partition"? (obviously I'll look :-) With
them, we could make a userland database that knows where each partition
has been seen, and so we can keep user access permissions (and ACLs)=20
constant in face of repartitioning. Since GPT is a flat partitioning, it=20
won't need them as much. I also understand this is a WIP, and this can=20
come later.
Oh, still looking. But the locators would be important only for whatever=20
wants to "add" wedges. So most of the code (all the diff I've looked at so=
=20
far) won't care. And only code that keeps things the same across boots=20
will truely care, so the kernel level may not care, other than carry=20
around another ascii string for the wedge.
> The wedges implementation is a work-in-progress at the moment, designed
> to allow for the use of old-style disk naming while wedges are still
> under development. Features of the current wedges implementation:
>=20
> 1. More items are moved from individual disk softc structures
> into "struct disk". Among other things, this allows for
> information sharing and better synchronization between
> wedges and their parent disks.
>=20
> 2. I/O is enqueued on the wedge and a new buf allocated in order
> to perform I/O on the parent. This is a transitional=20
> measure;
> I would like to eventually make it possible for disk drivers=
=20
> to
> operate directly on the buf provided to the wedge.
>=20
> 3. Once wedges are created on a disk, I/O to that disk may only
> be performed through its wedges, or on the disk's RAW_PART.
> Wedges may not be created on a disk if any partition other
> than RAW_PART is open.
In the long run, I think we'll need to do something different here. I=20
liked the original wedges idea of you just have enough wedge info in the=20
kernel to boot, then you let userland find all the wedges. The implication=
=20
of this is that we would want to add wedges to the boot disk after / has=20
been mounted. But that can be fixed later, when we add some sort of code=20
to check for overlaps/errors.
> 4. A minphys entry point is added to "struct dkdriver". =20
> Eventually,
> I would like to fully utilize "struct dkdriver" as the=20
> interface
> to a disk from a wedge, rather than using a vnode. Once we=20
> are
> fully transitioned to wedges, I would like to see the=20
> traditional
> entry points to disk drivers go away, with the exception of=20
> an
> entry point for the raw disk, so that partitions may be=20
> created
> on it.
>=20
> 5. My patch includes modifications to make wedges work with the=
=20
> "wd"
> driver. I will convert the other disk drivers over time. An
> outstanding question: What should we do about floppy drives?
I'd say leave them alone. We don't support partitioning them; the letters=
=20
are just the formatting density.
> 6. I have modified fsck and mount to use the partition type=20
> names
> that wedges provide. Conveniently, I have defined names that
> match the fsck_* and mount_* names for the various partition
> types that indicate file systems.
>=20
> Known issues:
>=20
> 1. You can't currently newfs a wedge. This is because newfs
> requires the old-style DIOCGDLABEL ioctl, which wedges do
> not support. I am working on a means for exporting the
> parent disk's geometry through the wedge, which is what
> newfs wants.
>=20
> 2. Related to (1), what to do about the block size / frag size
> entries in "struct partition" (part of "struct disklabel",
> and this antiquated and obsolete and not part of wedges)?
I'd say keep them as much as we can. Obviously if the on-disk labeling
won't support the extra info, we can't keep it. But to the extent we can
keep this stuff, I'd like us to. I know that I've saved my butt on
occasion (ok, twice) from having this info in the disklabel. Yes, it's not=
=20
a cure-all; if the disk dies or the part map gets scribbled all over, we=20
lose. But it helps.
I think some way to extract the geometry plus some way of describing the=20
partition itself. For the latter, we talked a bit and I was thinking=20
something like:
struct wedgeinfo {
uint64_t wi_offset;
uint64_t wi_size;
union {
uint32_t fsize;
uint32_t cdsession;
} _wi_u4;
#define wi_fsize _wi_u4.fsize
#define wi_cdsession _wi_u4.cdsession
uint8_t wi_fstype;
uint8_t wi_frag;
union {
uint16_t cpg;
uint16_t sgs;
} _wi_u2;
#define wi_cpg _wi_u2.cpg
#define wi_sgs _wi_u2.sgs
Yes, that's binary-compat with the partition-extra stuff we have in struct=
=20
disklabel now.
So back to the newfs question, I think newfs wants an ioctl to read the=20
disk info and the part info. They could both be loaded into some uber=20
wrapper struct. Then it wants a function to "write" the part info.=20
Obviously the size and offset woudn't actually be writable.
I'm glad you're working on this, as I'm actually working on making it so=20
that we support changing fs type on an Apple PartMap. The "ioctl() to=20
write struct partition" trick is what I was going to use there, and then=20
add code so that the kernel could write the change back. I want to move=20
away from all of the "write disklabel" stuff for the reasons that motivate=
=20
wedges above. So this change is all in the right direction. :-)
> I would like to get "wedges" checked into the tree to allow for greater=
=20
> collaboration on it. Since it does not interfere with the use of disks=
=20
> through the traditional interface, I don't think it's necessary to put=20
> this on a branch.
Sounds like an excelent idea!
Take care,
Bill
--PmA2V3Z32TCmWXqI
Content-Type: application/pgp-signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)
iD8DBQFBUfh4Wz+3JHUci9cRAmgvAJ9j9kq/WN2eBV2nJrffTcpLCwfusgCgh+y0
P0l9edR0RV4aqdtUMZNi1dU=
=jABM
-----END PGP SIGNATURE-----
--PmA2V3Z32TCmWXqI--