Subject: Re: Supporting sector size != DEV_BSIZE
To: Darrin B. Jewell <jewell@mit.edu>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 06/24/2002 21:21:15
On Mon, 24 Jun 2002, Darrin B. Jewell wrote:

> It sounds like you have uncovered the same issues I noticed.  My
> philosophy about the appropriate route to follow centers around two
> points:
>
>   1. The compiled in value for DEV_BSIZE should always be 512
>   2. existing media precedent should be followed to decide where to
>      change current uses of the DEV_BSIZE constant.
>
> I probably should have made this clearer in my original mail.  My
> decision for 1. is that this value is not retrieved from persistent
> media and so it should not be changed from its current value.  My
> decision for 2. is to avoid introducing arbitrary incompatibilities or
> accidentally setting new precedent.

I disagree with point 1. We should be free to change DEV_BSIZE as we wish
(as long as it's 2**n).  That's the only way to tell if we're truly
independent of it, or if we have lurking dependencies. The one caveat is
that it won't work to set DEV_BSIZE to be bigger than a disk's media size.
i.e. we can't (easily)  use 512-byte-sector devices on a 1k-DEV_BSIZE
kernel.

> Bill Studenmund <wrstuden@netbsd.org> writes:
> | > If I recall from my investigation there were at least
> | > the following potentially independent sources of block size:
> | >   . units based on a 512 byte DEV_BSIZE
> | >   . units based on the ffs superblock (see FFS_DEV_BSIZE below)
> |
> | Note: those are file system blocks aka frags.
>
> I would like to carefully assert that my definition of FFS_DEV_BSIZE
> is explicitly not the file system fragment size.  Under my definition,
> the file system fragment size in bytes is determined by fs->fs_fsize
> Even our current newfs sources set the default value for the fragment
> size to 1024.

I took units based on the superblock to mean fs->fs_fsize units.

> I also _always_ define the kernel constant DEV_BSIZE to be 512 and
> _never_ use a different value for it.  By treating it as a fundamental
> constant that never changes and is never retrieved from persistent media,
> it becomes an independent unit.
>
> | >   . units based on the disklabel d_secsize
> | >      ( this should always match the hardware device)
> |
> | Note: the latter isn't necessarily true. If you take a disk image & move
> | it to another system, it may change. Folks wish to continue using the
> | disklabel number.
>
> This is why I mentioned it.  I am not as adamant about this,
> but I was thinking that the in core value for this field
> should always match the hardware sector size.  Currently,
> the device strategy routines use d_secsize to interpret
> bp->b_blkno.   If d_secsize does not match the hardware sector
> size, then the device strategy routines will need to be
> modified to do the appropriate conversion.

I think the modification has already been done. Looking at sd.c for
instance, it happily translates from block numbers in DEV_BSIZE units into
local sectors. It assumes that the sector size == the disklabel size, but
that's a reasonable assumption.

I'm not sure about your use of "interpret." Yes, d_secsize is used when
looking at b_blkno. But b_blkno is kept in units of DEV_BSIZE.

> | > At the time, I found the following definitions useful:
> | >
> | >   #define FFS_DEV_BSHIFT(fs) ((fs)->fs_fshift-(fs)->fs_fsbtodb)
> |
> | That should be a constant in the ufs mount structure (the in-kernel
> | thing). We don't need to subtract those constants every time; they aren't
> | going to change.
>
> That would be acceptable, although the optimization you suggest is
> completely in the noise and adds considerable unnecessary complexity.
> There are already lots of cases in fs.h that do this kind of
> extra math repeatedly at run time.
>
> | >   #define ffs_btodb(fs, b)   ((b) >> FFS_DEV_BSHIFT(fs))
> | >   #define ffs_dbtob(fs, db)  ((db) << FFS_DEV_BSHIFT(fs))
> | >   #define FFS_DEV_BSIZE(fs)  ffs_dbtob(fs,1)
> | >
> | > I remember facing a couple of decisions about what units
> | > quotas and free block counts were kept in.  Can you brief me
> | > on decisions you made regarding these counters when authoring
> | > your patches?  Do you have a rational for your choice?
> |
> | He chose the design philosophy I (ehm strongly) suggested. :-)
>
> Careful, we need to keep compatibility with other vendors who
> have already made this choice, for better or worse.  I have to

What about *ourselves*? I think compatability with ourselves is the most
important feature we can have. :-)

Looking at NetBSD 1.5, we have already defined a format for what NetBSD's
ffs does on non-512 byte media. Mainly as the only way to make it work
there is to have set DEV_BSIZE to match the disk. :-) That set a
zero-modification design spec, which is why I sugested it.

> go back and reaffirm what choice Apple/NeXT did here.  Did sun,
> dec or anyone else also set precedent for this case?  At the
> end of this email, I include the partial dumpfs output for the
> NeXTstep 3.3 OS distribution CD.  You can see some of the
> choices they made by examining the superblock values.
> I can provide the rest of the dumpfs output if someone wants
> to look at it.  This was generated by the unmodified dumpfs in
> currently in our source tree.

I think they all did different things. I know HP-UX has DEV_BSIZE=1024,
and Convex's BSD OS (forgot the name) had DEV_BSIZE=2048.

I think what would be more interesting is to see what Apple/Next did on
the old 1k media.

Also interesting would be to see what some inodes look like.

> | NetBSD 1.5 only supported file systems on disks where the physical sector
> | size == DEV_BSIZE. So anything that uses DEV_BSIZE and is stored on-disk
> | should really use the fs-declared sector size (FFS_DEV_BSHIFT() above,
> | though Trevin used a different name).  When formatting a file system,
> | "FFS_DEV_BSHIFT()" will equal the sector size of the media. That fs can
> | later be moved around, because the superblock has enough info to recreate
> | "FFS_DEV_BSHIFT()".
>
> I agree that normally when creating a filesystem FFS_DEV_BSHIFT
> will match the will equal the sector size of the media, but that
> it may mismatch the media if it has been moved around.
>
> I think most current uses of DEV_BSIZE need to be examined
> to determine whether they should use FFS_DEV_BSIZE, d_secsize,
> or a DEV_BSIZE constant of 512.

I think we can simplify things. Formatting an ffs would care about
d_secsize, but after that, all we need to worry about is what you call
FFS_DEV_BSIZE and DEV_BSIZE, as long as the numbers are reasonable (a
fragment size or FFS_DEV_BSIZE less than DEV_BSIZE won't work I don't
think).

> As I mentioned, here is the partial
> dumpfs output from a nextstep 3.3 operating system distribution CD:

Nothing in it looked bad or conflicting with what Trevin was doing.