Subject: Re: The demise of DEV_BSIZE
To: Bill Studenmund <wrstuden@nas.nasa.gov>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 10/05/1999 15:14:03
cool.  I had a start on this in the UBC branch, but I'm glad someone else
is doing the rest of it.  I have a few comments:

1.  do we really want to pretend to support non-power-of-two devices at all
    in the device interface?  we kinda went thru this when this whole subject
    came up before, and I thought the consensus was that it wasn't worthwhile.

2.  do we really need a (*d_bsize)() ?  I recall that there was already a
    device ioctl that returns this info... or maybe I was just thinking
    that that's how I'd do it eventually.

3.  I'm not sure what "swap blocks" are, but I would guess they should
    be in pagesize units, since that's how swap space is managed.

4.  I'd think the goal should be to get rid of DEF_BSIZE eventually too.
    purely software devices like md can define their own constants.

5.  perhaps the sector size info in the on-disk disklabel should be
    ignored and replaced with the info from the device itself?
    there are probably further disklabel implications to all this.

-Chuck


On Tue, Oct 05, 1999 at 02:17:29PM -0700, Bill Studenmund wrote:
> As part of a project here at NAS to test how *BSD systems deal with lots
> of disk, I've had to get NetBSD working with non-512-byte sector disks.
> 
> To do this, I've worked up patches based on Koji Imada's third proposal
> (PR 3972) and with comments from this list the last time this topic was
> brought up. I want to thank Leo for giving me near-current patches for
> Koji's 3rd proposal.
> 
> I'll repeat what I understand of his proposal here (so y'all can
> understand me even if I misunderstood the proposal :-)  :
> 
> Koji's third proposal: the block numbers in struct buf will be in units of
> the natural block size of the media. So on a 0.5 K sector device, they are
> in 512-byte blocks. On a 1 K device, they are in 1 K blocks. All routines
> which need to worry about block size will just deal with whatever size the
> media posesses. Also (and this is the difference from his 1st proposal),
> filesystems should be able to deal with the filesystem being on a
> different block size device than the one on which it was made. So say I
> have a filesystem made on a 512 byte device, I can dd it to a 1 K sector
> device, and it will just work.
> 
> I also wanted to support media with a sector size which isn't a power of
> two. The i/o system should support it, but filesystems don't necessarily
> have to support non-power-of-2 sectors.
> 
> What I've done: block numbers in struct buf are now in blocks on the media
> - the "natural" media size. ffs has been adjusted so that it will work as
> long as there's only one filesystem block (fragment, actually) per disk
> block. So I can take an 8k/1k ffs from a 512-byte disk to a 1 K byte disk,
> but not to a 2 K byte disk. Supporting more than one data block (ffs frag)
> per disk block would be hard. I've not touched msdosfs or cd9660fs with
> respect to this, so the diffs are whatever Koji & Leo have done. :-)
> 
> I've also changed DEV_BSIZE & DEV_BSHIFT to DEF_BSIZE & DEF_BSHIFT.
> Unfortunatly I can't just delete them yet... :-(
> 
> The btodb and dbtob macros have changed. They now take a shift and size
> parameter. They are:
> 
> #define dbtob(x, sh, bks)       ((sh) ? ((x) << (sh)) : ((x) * (bks)))
> #define btodb(x, sh, bks)       ((sh) ? ((x) >> (sh)) : ((x) / (bks)))
> 
> x is the value to be shifted, sh is the device's shift value, and bks
> is the block size in bytes. For a power of 2 block size, sh is the log
> base 2 of the block size. So for 512-byte blocks, sh is 9. For 1 K
> sectors, it's 10, etc. So if the device's block size is a power of 2 (most
> of them), these macros keep shifting. We only multiply and divide if the
> block size isn't a power of 2. This feature is important as dividing is
> always slow, and a number of our architectures have to use a math
> subroutine for division, which is even slower.
> 
> Both character and block devices have gained a new function call, d_bsize:
> 
> void    (*d_bsize)      __P((dev_t dev, int * bshift, int * bsize));
> 
> which fills in the bshift and bsize values for a device. bshift == -1
> indicates that the device isn't configured.
> 
> struct specinfo has gained two new fields, si_bshift and si_bsize. They
> cache the block size info for the relevant device. They are initialized in
> checkalias when the new struct specinfo is being generated for the device
> node.
> 
> struct mount also gained shift & size fields too (mnt_bshift & mnt_bsize),
> which reflect the values for the underlying device. The mount routines
> will now do a validity check on the device to make sure the filesystem is
> happy with the block size.
> 
> physio has grown two additional parameters, for the block shift and block
> size values. The readdisklabel and writedisklabel routines have also
> gained shift and size values.
> 
> I have modified the sd, cd, wd, and fd drivers to support these changes.
> For the moment, wd is using WD_DEF_BSIZE as I wasn't sure what to do with
> it at the time I made the change. The md driver uses DEF_BSIZE. The fd
> driver's support of the partition encoding the density has been extended
> so that it (on i386) can also encode the sector size. With changes to the
> format table, we should be able to support 256 byte or 1024 byte floppies
> (do they exist?).
> 
> Open issues:
> 
> We can't totally get rid of DEF_BSIZE. In addition to a few cases where we
> really need a DEF_BSIZE (md and memory disks come to mind - there's no
> underlying block size from which to determine values), there are a number
> of other uses layered on top of it. For instance, UFS keeps track of
> "blocks" allocated to a file in units of DEV_BSIZE. I've changed this to
> UFS_BSIZE & UFS_BSHIFT. ufs quotas are in the same unit.
> 
> lfs is sprinkled with DEV_BSIZE. I changed them to DEF_BSIZE for now, but
> this needs fixing. Does struct lfs reflect the on-disk "superblock"? The
> problem I ran into is that it doesn't have fields for disk size (that I
> saw), and since it lacks a pointer to struct mount (which has disk block
> size info), it's hard for all the routines which are passed a struct lfs *
> to get the disk block size right.
> 
> Swap "blocks" are in DEF_BSIZE units. Does that need to change?
> 
> vnd, raidframe, and ccd haven't been updated to reflect these changes. I
> think that both raidframe and ccd should only agregate like-sized devices.
> vnd obviously needs to be able to change block sizes.
> 
> So far only i386 has been fully changed. I've changed the disklabel entry
> points for other ports, but I'm not sure if I got all the calls to
> auxiliary disklabel routines.
> 
> Other disk drivers need work, like rd, rz, xy, & xd. Are there others?
> 
> Should tape drives do anything with block size? I've done nothing as I'm
> not exactly sure what we should do, nor how to do it (say in the face of
> variable block size tapes).
> 
> disklabel writing needs work in that we shouldn't accept a disklabel which
> we know is not the device's block size. i.e. for sd & cd drives, we can
> querry the device to see what it's block size is. We shouldn't let you set
> a disklabel with a different block size. But on devices where we can't
> querry the block size (I think xy, xd, rd, and non-ata wd), we need to be
> able to set the block size in the disklabel as it is the authority on the
> block size. :-) Also, if the block size of a drive changes (either we
> write a new disk label or we note a probable device reports different
> sector sizes), we need to update existing devices nodes. Should we vgone
> them, or just update the size fields in their struct specinfo. I think
> vgone..
> 
> My current thought is to make these diffs (which I'm still assembling)
> into a branch. We should be able to merge them in fairly soon. :-)
> 
> I have a system with both 512 and 2048 byte sector disks in it, and I've
> simultaneously used filesystem on both sized devices. :-)
> 
> Thoughts? I think I covered everything I've done.
> 
> Take care,
> 
> Bill