tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

In-kernel units for block numbers, etc ...



    Date:        Mon, 17 Aug 2015 23:20:01 +0000 (UTC)
    From:        mlelstv%serpens.de@localhost (Michael van Elst)
    Message-ID:  <20150817232001.13113A6558%mollari.NetBSD.org@localhost>

  | The following reply was made to PR bin/50108; it has been noted by GNATS.

The quotes are from a message Michael van Elst made in reply to PR bin/50108
back on 17th August (2015 in case it isn't implied).   The full message
should be available in the PR.

As people who have been following the (meandering) thread related to
"beating a dead horse" on the netbsd-users list will know, I have been
looking at supporting drives with 4K sector sizes properly in NetBSD.

Currently what we have is a mess.   Really, despite this ...

  |  At some point, about when the SCSI subsystem was integrated into
  |  the kernel, the model was changed, the kernel now uses the fixed
  |  DEV_BSIZE=512 coordinates and the driver translates that into
  |  physical blocks.

being kind of close, it is just not true.

Now that PR, and Michael's message, were mostly on the topic of
kernel/user interactions, and noted that userland is expected to use
sector size units, not DEV_BSIZE, and all is fine most of the time,
except for code that's shared (shared WAPBL code was the issue of the PR).

So, not getting all the in-kernel details precisely correct may be
excused.

For what I have been looking at however, that doesn't work, and that
"the driver translates" is a problem.

That's because it simply isn't true that the kernel uses DEB_BSIZE units
(let's call those "blocks" for the purpose of this e-mail, and "sectors"
will be things measured in the relevant native sector size) everywhere.

For stuff related to low level properties of devices (like reading and
writing labels, etc) the kernel uses sectors, and sector numbers, not
blocks and block numbers.   But they all end up going through the same
driver interfaces to actually perform the I/O.

Now at the minute, things are carefully arranged so that in the normal
case (eg: a ffs on a drive) it all just works, the translations happen when
they should and don't happen when they shouldn't.   It is almost magic.

Unfortunately, once we add the "stacked" devices (cgd, and ccd for sure,
perhaps lvm and raidframe, I haven't looked at those enough to know)
the model breaks down, and we get incorrect conversions.

Currently at least cgd and (I believe) lvm just pretend that sectors
are blocks, and that makes stuff work ... when sectors are blocks (ie:
when sector size == DEV_BSIZE, which has been almost always) that just
works (obviously) and when sectors are bigger than blocks, it also "works",
you just get less available space (for 4K sector drives, cgd and lvm both
give you 1/8 the space that you should have had on the device.)
Sector sizes smaller than blocks are rare, and becoming rarer, we
mostly just ignore that case (there are fragments of code in the kernel
that pretend to allow it, but most of the code just assumes it cannot happen.)

ccd (especially if combining a 4k byte sector device with a 512 byte sector
device) is simply a mess - perhaps almost a candidate for extermination.
(Or maybe it can be resuscitated, who knows right now?)

raidframe I haven't thought about, or investigated, at all.

I see two (wildly different) approaches (currently see) that could be
used to fix the problems...

One is to convert the kernel to use byte offsets absolutely everywhere.
Convert to/from byte offsets when dealing with hardware (like disks that
want a LBA, for some size of B) and with formats that store units
different than bytes (like labels on disks, etc).   But internally,
everything would be counted in bytes, always (zero exceptions allowed.)

The other is to carry explicit unit designators along with every blk/sec
number, everywhere (so struct buf, which has b_blkno, would also need a
b_blkunit field added for example - no endorsement implied for that name.)

Either of those would remove the ambiguity that we currently face (and
without needing to special case the code to deal with the possibility that
sectors might be bigger, or smaller, than blocks).

The first of those has the advantage that it more closely models the
kernel/user generic interface for most i/o sys calls.  Read/write get
told how many bytes to transfer, not how many blocks, or sectors, even
when transferring to/from a device that requires a fixed number of sectors
in order to work.   Similarly lseek always gets given a byte offset, never
a block or sector number.

It has two disadvantages that I can see at the minute.   One is that it
would require "large int" fields (at least 64 bits) everywhere, which would
infect even small old systems (sun2, vax, ...) which are probably never going
to see a device with anything but 512 byte sectors, nor anything big enough
to need more than 32 bits as a sector number ... and which tend not to have
native 64 bit arithmetic available in hardware (and are already slow.)

Second, it makes some things that are currently constants become variable.
For example, a GPT primary label goes in sector 1.  That's at byte offset
512 when that is the sector size, or byte offset 4096 for 4K drives.
How much of a problem this would be I haven't really investigated yet,
but it is likely to infect far more than we'd like to hope I suspect.

The second choice solution requires changes to data structs and func
signatures all over the place, it would be a major shake up of the
internals of the system.

Currently I have no real preference (a slight leaning towards the 2nd)
so I'm seeking opinions.   Is one of these approaches better than the
other (and if so why) or is there some other way I haven't considered yet?

kre



Home | Main Index | Thread Index | Old Index