Subject: The demise of DEV_BSIZE
To: None <tech-kern@netbsd.org>
From: Bill Studenmund <wrstuden@nas.nasa.gov>
List: tech-kern
Date: 10/05/1999 14:17:29
As part of a project here at NAS to test how *BSD systems deal with lots
of disk, I've had to get NetBSD working with non-512-byte sector disks.

To do this, I've worked up patches based on Koji Imada's third proposal
(PR 3972) and with comments from this list the last time this topic was
brought up. I want to thank Leo for giving me near-current patches for
Koji's 3rd proposal.

I'll repeat what I understand of his proposal here (so y'all can
understand me even if I misunderstood the proposal :-)  :

Koji's third proposal: the block numbers in struct buf will be in units of
the natural block size of the media. So on a 0.5 K sector device, they are
in 512-byte blocks. On a 1 K device, they are in 1 K blocks. All routines
which need to worry about block size will just deal with whatever size the
media posesses. Also (and this is the difference from his 1st proposal),
filesystems should be able to deal with the filesystem being on a
different block size device than the one on which it was made. So say I
have a filesystem made on a 512 byte device, I can dd it to a 1 K sector
device, and it will just work.

I also wanted to support media with a sector size which isn't a power of
two. The i/o system should support it, but filesystems don't necessarily
have to support non-power-of-2 sectors.

What I've done: block numbers in struct buf are now in blocks on the media
- the "natural" media size. ffs has been adjusted so that it will work as
long as there's only one filesystem block (fragment, actually) per disk
block. So I can take an 8k/1k ffs from a 512-byte disk to a 1 K byte disk,
but not to a 2 K byte disk. Supporting more than one data block (ffs frag)
per disk block would be hard. I've not touched msdosfs or cd9660fs with
respect to this, so the diffs are whatever Koji & Leo have done. :-)

I've also changed DEV_BSIZE & DEV_BSHIFT to DEF_BSIZE & DEF_BSHIFT.
Unfortunatly I can't just delete them yet... :-(

The btodb and dbtob macros have changed. They now take a shift and size
parameter. They are:

#define dbtob(x, sh, bks)       ((sh) ? ((x) << (sh)) : ((x) * (bks)))
#define btodb(x, sh, bks)       ((sh) ? ((x) >> (sh)) : ((x) / (bks)))

x is the value to be shifted, sh is the device's shift value, and bks
is the block size in bytes. For a power of 2 block size, sh is the log
base 2 of the block size. So for 512-byte blocks, sh is 9. For 1 K
sectors, it's 10, etc. So if the device's block size is a power of 2 (most
of them), these macros keep shifting. We only multiply and divide if the
block size isn't a power of 2. This feature is important as dividing is
always slow, and a number of our architectures have to use a math
subroutine for division, which is even slower.

Both character and block devices have gained a new function call, d_bsize:

void    (*d_bsize)      __P((dev_t dev, int * bshift, int * bsize));

which fills in the bshift and bsize values for a device. bshift == -1
indicates that the device isn't configured.

struct specinfo has gained two new fields, si_bshift and si_bsize. They
cache the block size info for the relevant device. They are initialized in
checkalias when the new struct specinfo is being generated for the device
node.

struct mount also gained shift & size fields too (mnt_bshift & mnt_bsize),
which reflect the values for the underlying device. The mount routines
will now do a validity check on the device to make sure the filesystem is
happy with the block size.

physio has grown two additional parameters, for the block shift and block
size values. The readdisklabel and writedisklabel routines have also
gained shift and size values.

I have modified the sd, cd, wd, and fd drivers to support these changes.
For the moment, wd is using WD_DEF_BSIZE as I wasn't sure what to do with
it at the time I made the change. The md driver uses DEF_BSIZE. The fd
driver's support of the partition encoding the density has been extended
so that it (on i386) can also encode the sector size. With changes to the
format table, we should be able to support 256 byte or 1024 byte floppies
(do they exist?).

Open issues:

We can't totally get rid of DEF_BSIZE. In addition to a few cases where we
really need a DEF_BSIZE (md and memory disks come to mind - there's no
underlying block size from which to determine values), there are a number
of other uses layered on top of it. For instance, UFS keeps track of
"blocks" allocated to a file in units of DEV_BSIZE. I've changed this to
UFS_BSIZE & UFS_BSHIFT. ufs quotas are in the same unit.

lfs is sprinkled with DEV_BSIZE. I changed them to DEF_BSIZE for now, but
this needs fixing. Does struct lfs reflect the on-disk "superblock"? The
problem I ran into is that it doesn't have fields for disk size (that I
saw), and since it lacks a pointer to struct mount (which has disk block
size info), it's hard for all the routines which are passed a struct lfs *
to get the disk block size right.

Swap "blocks" are in DEF_BSIZE units. Does that need to change?

vnd, raidframe, and ccd haven't been updated to reflect these changes. I
think that both raidframe and ccd should only agregate like-sized devices.
vnd obviously needs to be able to change block sizes.

So far only i386 has been fully changed. I've changed the disklabel entry
points for other ports, but I'm not sure if I got all the calls to
auxiliary disklabel routines.

Other disk drivers need work, like rd, rz, xy, & xd. Are there others?

Should tape drives do anything with block size? I've done nothing as I'm
not exactly sure what we should do, nor how to do it (say in the face of
variable block size tapes).

disklabel writing needs work in that we shouldn't accept a disklabel which
we know is not the device's block size. i.e. for sd & cd drives, we can
querry the device to see what it's block size is. We shouldn't let you set
a disklabel with a different block size. But on devices where we can't
querry the block size (I think xy, xd, rd, and non-ata wd), we need to be
able to set the block size in the disklabel as it is the authority on the
block size. :-) Also, if the block size of a drive changes (either we
write a new disk label or we note a probable device reports different
sector sizes), we need to update existing devices nodes. Should we vgone
them, or just update the size fields in their struct specinfo. I think
vgone..

My current thought is to make these diffs (which I'm still assembling)
into a branch. We should be able to merge them in fairly soon. :-)

I have a system with both 512 and 2048 byte sector disks in it, and I've
simultaneously used filesystem on both sized devices. :-)

Thoughts? I think I covered everything I've done.

Take care,

Bill