Subject: Re: non-512-byte-sector devices vs. UBC
To: Chris G. Demetriou <cgd@netbsd.org>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 06/13/1999 10:29:19
(I'm back from usenix now... that was really fun!)


Chris G. Demetriou writes:
> > (2) make DEV_BSIZE device-specific.  this would require all clients
> >       of the "struct buf" interface to determine the units of b_blkno
> >       before making i/o requests.
> 
> I'm inclined towards this one or a modification of (3).
> 
> 
> > (3) change DEV_BSIZE to 1, and make b_blkno a 64bit field (possibly by way
> > 	of changing "daddr_t" to a 64bit type).
> >     advantages:
> > 	also solves the problem in a natural way, with much less code
> > 	change than (2).  most code should work as-is.
> > 	supports devices larger than 2^40 bytes (though this is also true
> > 	if we make b_blkno 64bits with the existing DEV_BSIZE).
> >     disadvantages:
> > 	though this change to the interface is much less drastic than (2),
> > 	it could still involve needing to "fix" some code which is working
> > 	with the current DEV_BSIZE.
> > 	the lower 9 bits of b_blkno will be wasted for most uses, since
> > 	most devices and filesystems assume they would be 0.
> > 	64bit b_blkno will cause extra overhead for devices that don't
> > 	need it.
> 
> Uh, excuse me, natural in what universe?
> 
> DEV_BSIZE having a value of 512 kinda makes sense, in that that's the
> block size of a lot of devices we use.
> 
> DEV_BSIZE going away and/or having a 'variable' value also makes
> sense, because it, or its replacement code, actually uses the block
> size of the underlying device.
> 
> This proposal isn't nearly so logical.  In what world is the device
> block size 1 byte?
> 
> What you've really said here is that you want to kill DEV_BSIZE and
> replace b_blkno with something named something like b_offset, but you
> don't want to go to the trouble of actually doing the work to do that.
> 
> If you want to actually do that, then _do it_, don't go half way.

yes, you've figured me out, I wanted to switch to byte offsets
and be lazy about it.  :-)

however, after a little more investigation, I find that changing DEV_BSIZE
to 1 would still require changing quite a bit of code anyways, since a bunch
of places use it as the size of an i/o when they want to read one sector.
in general, if we lose the assumption that sectors are DEV_BSIZE bytes for
all devices, then all clients of struct buf have to determine the sector
size to make sure the request is an integral number of sectors in length,
regardless of the units of the offset.  so leaving DEV_BSIZE defined at all
looks like a bad idea.

I don't have a good feeling for whether it's better to use device-specific
b_blkno units or to replace b_blkno with b_offset, I don't see a strong
argument for either one over the other.  I guess I lean towards the
device-specific b_blkno, since then all the math to compute the sector
number can be done in one place, which could result in faster code.

on a related note, it does seem odd that the current buf interface has
the offset of the i/o (b_blkno) in sectors, but the size of the i/o
(b_bcount) in bytes.  it would be more consistent for both offset and size
to have the same units.  or are there devices that can read partial
sectors, but only at the beginning of a sector?  (I'm not saying that
we need to do anything about this, just observing the inconsistency.)

also, while I was looking at DEV_BSIZE usage I noticed that ffs has an
on-disk superblock field "fs_fsbtodb", which means that an ffs filesystem
has embedded in it the sector size of the underlying device.  yuk.


> On a related note, the problem of non-power-of-two block sizes was
> brought up.  Is it really intended that they be supported in the
> kernel?
> 
> I'm inclined to think that representation in terms of block numbers is
> likely to be more efficient than representation in terms of bytes, but
> not too much worse because it should be easy to look up block size (or
> 'size shift'), then calculate bytes easily.  With non-power-of-two
> block sizes, however, no matter what you do, you need to do division
> for byte->block and multiplication for block->byte translation.  On
> some architecture, division is expensive.
> 
> Sure, sure, compared to the time it takes to do an I/O a few dozen
> divisions isn't too much, but it becomes more significant if you're
> talking about a cached block, and in either case it does consume CPU
> that could have been used on something else...


we should be able to arrange it so that the overhead of multiplies and
divides is only incurred when actually accessing such a device, which
I'd say is ok.  I'll argue that filesystems should never support
non-power-of-two devices, so it would only be user processes doing physio.
in the current example of audio CDs, the existing method of issuing
raw scsi commands would still be available, so if the overhead of
having the kernel do the math is deemed to much, it can still be avoided.

at any rate, whether or not we support this doesn't really affect the
interface.  both of the remaining interfaces under discussion allow it.

-Chuck