current-users: Re: "sector range" driver (was Re: APPLE

Subject: Re: "sector range" driver (was Re: APPLE_UFS on i386?)
To: Christian Limpach <chris@pin.lu>
From: Bill Studenmund <wrstuden@netbsd.org>
List: current-users
Date: 03/27/2003 17:17:32
On Wed, 26 Mar 2003, Christian Limpach wrote:

> Quoting "Perry E. Metzger" <perry@piermont.com>:
>
> > It would be very nice to have such a driver. I don't suppose you could
> > submit it?
>
> It's been available for a while and I've been using it since.  It probably
> won't work on -current because of changes to the devsw/lkm stuff.  If there's
> really interest to include it, I'd update it to -current and make whatever
> changes are needed/requested.

This will probably serve as the basis for adding wedges to the system. To
do so, it needs a few changes. Well one main one.

> There's more info in http://mail-index.netbsd.org/netbsd-
> users/2003/02/04/0012.html.  The driver is the device-mapper part and it's at
> version 0.03 now.
>
> I wrote the driver to have Linux LVM2 on NetBSD.  The driver allows to
> implement volume management software entirely in userspace.  It currently
> supports linear and striped (ccd compatible) mapping and I've added some
> limited raid1 support (it always reads on the 1st device and writes to both,
> it's missing scheduling reads on multiple devices, fallback to the mirror on
> error and passing errors up to userspace).  I've also added support for
> device level snapshots, but it has some unresolved issues and I'm not
> pursueing this at the moment since I don't really need the functionality.
>
> Mappings are implemented as source-level and/or LKM-level plugins.  The
> driver is implemented as an LKM for 1.6_STABLE/1.6.1_RC2, tested on i386.  A
> device is configured through a table which defines for ranges of sectors how
> these sectors are mapped to other devices.
>
> Some examples:  A device is created with the dmsetup create command.
> The table:
>     0 1639008 linear /dev/wd0d 5210352
> defines a device equal to this partition in wd0's disklabel:
>      e:   1639008   5210352     4.2BSD   1024  8192    86   # (Cyl. 5169 -
> 6794)
> By having multiple lines in the table, sectors can be gathered from several
> places.
>
> The table:
>     0 327680 striped 2 32 /dev/wd0m 0 /dev/wd0n 0
> defines a device equal to the ccd with the following config:
>     ccd0 32 none /dev/wd0m /dev/wd0n
> 327680 is the total number of sectors on /dev/wd0[mn].
>
> The table:
>     0 163840 raid1 2 /dev/wd0m 0 /dev/wd1d 2056576
> defines a raid1 on partition wd0m and on sectors 2056576-2220416 on wd1d.
> It's also possible to have more than 1 copy.

If we're wanting to use it as the kernel side of wedges, we need to do
something other than refer to "/dev/wd0X" for the source of backing
blocks. Refering to anything other than the raw partition means we're
still basing this on partition info, and the point is to move partition
info out into userland. We also can't refer to the raw partition given the
current syntax of an offset. The problem with an offset is that it's hard
(and dangerous) to keep in sync with the partitioning info.

What would work is using a partition locator string. The exact format is
still up in the air, so feel free to comment. But something like, "NDL 4"
to refer to NetBSD Disklabel partition 4, "APL 5" for Apple Partition Map
5, "MBR 0 NDL 2" for partition 2 in the NetBSD disklabel in MBR partition
0, or "MBR 3 MBR 2" for an extended MBR partition - partition 2 in the MBR
in partition 3 of the main MBR.

The point is to configure based on names/locators for partitions, not
their locations. That way if another OS repartitions, we don't start
scribbling on random space.

Obviously there would need to be a userland library to read different
partitioning schemes and supply a list of found partitions that this tool
would put together into the wedges.

It'd be fine if the kernel interface was just offsets, but the userland
config file should use partition locators. It'd also be fine if the
partition locator support came in stages - just supporting the partition
type(s) on your box would be fine in an initial submission.

Thoughts?

Take care,

Bill