Subject: Re: Supporting sector size != DEV_BSIZE
To: Bill Studenmund <wrstuden@netbsd.org>
From: Darrin B. Jewell <jewell@mit.edu>
List: tech-kern
Date: 06/24/2002 23:43:59
It sounds like you have uncovered the same issues I noticed.  My
philosophy about the appropriate route to follow centers around two
points:

  1. The compiled in value for DEV_BSIZE should always be 512
  2. existing media precedent should be followed to decide where to
     change current uses of the DEV_BSIZE constant.

I probably should have made this clearer in my original mail.  My
decision for 1. is that this value is not retrieved from persistent
media and so it should not be changed from its current value.  My
decision for 2. is to avoid introducing arbitrary incompatibilities or
accidentally setting new precedent.

Bill Studenmund <wrstuden@netbsd.org> writes:
| > If I recall from my investigation there were at least
| > the following potentially independent sources of block size:
| >   . units based on a 512 byte DEV_BSIZE
| >   . units based on the ffs superblock (see FFS_DEV_BSIZE below)
| 
| Note: those are file system blocks aka frags.

I would like to carefully assert that my definition of FFS_DEV_BSIZE
is explicitly not the file system fragment size.  Under my definition,
the file system fragment size in bytes is determined by fs->fs_fsize
Even our current newfs sources set the default value for the fragment
size to 1024.

I also _always_ define the kernel constant DEV_BSIZE to be 512 and
_never_ use a different value for it.  By treating it as a fundamental
constant that never changes and is never retrieved from persistent media,
it becomes an independent unit.

| >   . units based on the disklabel d_secsize
| >      ( this should always match the hardware device)
| 
| Note: the latter isn't necessarily true. If you take a disk image & move
| it to another system, it may change. Folks wish to continue using the
| disklabel number.

This is why I mentioned it.  I am not as adamant about this,
but I was thinking that the in core value for this field
should always match the hardware sector size.  Currently,
the device strategy routines use d_secsize to interpret
bp->b_blkno.   If d_secsize does not match the hardware sector
size, then the device strategy routines will need to be
modified to do the appropriate conversion.

| > At the time, I found the following definitions useful:
| >
| >   #define FFS_DEV_BSHIFT(fs) ((fs)->fs_fshift-(fs)->fs_fsbtodb)
| 
| That should be a constant in the ufs mount structure (the in-kernel
| thing). We don't need to subtract those constants every time; they aren't
| going to change.

That would be acceptable, although the optimization you suggest is
completely in the noise and adds considerable unnecessary complexity.
There are already lots of cases in fs.h that do this kind of
extra math repeatedly at run time.

| >   #define ffs_btodb(fs, b)   ((b) >> FFS_DEV_BSHIFT(fs))
| >   #define ffs_dbtob(fs, db)  ((db) << FFS_DEV_BSHIFT(fs))
| >   #define FFS_DEV_BSIZE(fs)  ffs_dbtob(fs,1)
| >
| > I remember facing a couple of decisions about what units
| > quotas and free block counts were kept in.  Can you brief me
| > on decisions you made regarding these counters when authoring
| > your patches?  Do you have a rational for your choice?
| 
| He chose the design philosophy I (ehm strongly) suggested. :-)

Careful, we need to keep compatibility with other vendors who
have already made this choice, for better or worse.  I have to
go back and reaffirm what choice Apple/NeXT did here.  Did sun,
dec or anyone else also set precedent for this case?  At the
end of this email, I include the partial dumpfs output for the
NeXTstep 3.3 OS distribution CD.  You can see some of the
choices they made by examining the superblock values.
I can provide the rest of the dumpfs output if someone wants
to look at it.  This was generated by the unmodified dumpfs in
currently in our source tree.

| NetBSD 1.5 only supported file systems on disks where the physical sector
| size == DEV_BSIZE. So anything that uses DEV_BSIZE and is stored on-disk
| should really use the fs-declared sector size (FFS_DEV_BSHIFT() above,
| though Trevin used a different name).  When formatting a file system,
| "FFS_DEV_BSHIFT()" will equal the sector size of the media. That fs can
| later be moved around, because the superblock has enough info to recreate
| "FFS_DEV_BSHIFT()".

I agree that normally when creating a filesystem FFS_DEV_BSHIFT
will match the will equal the sector size of the media, but that
it may mismatch the media if it has been moved around.

I think most current uses of DEV_BSIZE need to be examined
to determine whether they should use FFS_DEV_BSIZE, d_secsize,
or a DEV_BSIZE constant of 512.

| > Do you agree with my list of independent sources of block
| > size?  Are there any other fundamental ones not derived
| > from the above three?  Should we create a list of derived
| > indications of block size and which fundamental block
| > size they should be derived from?
| 
| There actually is one more. The buffer cache is kept in units of
| DEV_BSIZE. You can have a file system that was made with a DEV_BSIZE=1024
| get moved to a kernel with DEV_BSIZE=512. After these changes, we want
| that fs to work. So that means that when translating ffs_btodb() outputs
| to buffer cache offsets, we need to use a conversion to bridge between
| them. :-)

I think this is the case where I discuss modifying the hardware device
strategy routines above.

Thanks,
Darrin

As I mentioned, here is the partial
dumpfs output from a nextstep 3.3 operating system distribution CD:

# dumpfs ns33cd.ufs | head -22
file system: ns33cd.ufs
endian  big-endian
magic   11954   time    Sat Nov 12 00:44:21 1994
id      [ 0 0 ]
cylgrp  static  inodes  4.2/4.3BSD      fslevel 0       softdep disabled
nbfree  1406    ndir    3168    nifree  71290   nffree  51
ncg     45      ncyl    89      size    182272  blocks  176323
bsize   8192    shift   13      mask    0xffffe000
fsize   2048    shift   11      mask    0xfffff800
frag    4       shift   2       fsbtodb 0
cpg     2       bpg     1024    fpg     4096    ipg     1984
minfree 10%     optim   time    maxcontig 20000 maxbpg  512
rotdelay 0ms    rps     5
ntrak   32      nsect   64      npsect  0       spc     2048
symlinklen -1   trackskew 0     interleave 0    contigsumsize -1
maxfilesize 0xffffffffffffffff
nindir  2048    inopb   64      nspf    1
avgfilesize -1  avgfpdir -1
sblkno  8       cblkno  12      iblkno  16      dblkno  140
sbsize  2048    cgsize  2048    offset  64      mask    0xffffffe0
csaddr  140     cssize  2048    shift   9       mask    0xfffffe00
cgrotor 42      fmod    0       ronly   0       clean   0x01