Swift Griggs <swiftgriggs%gmail.com@localhost> writes:
> I'm curious about something, probably due to ignorance of the full
> dynamics of the vfs(9) layer. Why is it that folks don't choose file
> system block sizes and partition offsets that are least-common-factors
> that they share with the hardware layer. Ie.. Let's say the hard disk
> uses 4K pages, the file system uses 8K blocks, and the vendor
> recommends that you stay aligned with a 1GB value. Wouldn't operating
> on 8K blocks still satisfy the underlying device (since 8K operations
> would always be divisible by a factor of 4K) and the 1GB alignment may
> not always be perfect, but the 8K ops below it would eventually stack
> to 1GB exactly, too.
Good questions, and it boils down to a few things:
  - many devices don't have a way to report their underlying block
    sizes.  For example, if you buy a 2T spinning disk, it will very
    likely be one that has sectors that are actually 4K but an interface
    of 512B sectors.  So if you read, it's fine because it gets a 4K
    sector into the cache, and then hands you the piece you want.  And
    when you write, if you write a 512-byte sector, it has to
    read-modify-write.  Worse, if you write 4K or 8K but not lined up
    (which you will if your fs has 8K blocks but aligned to 63), it has
    to read-modify-write 2 sectors per write.
  - SSDs are even harder to figure out, as Andreas's helpful references
    in response to my question show.
  - filesystems sometimes get moved around, and higher up it's even more
    disconnected from the actual hardware
So there are two issues: alignment and filesystem block/frag size, and
both have to be ok.  For larger disks, UFS uses larger block sizes by
default (man newfs).  So that's ok, but alignment is messier.  We're
seeing smaller disks with 4K sectors or larger flash erase blocks and
512B interfaces now.
And, there are also disks with native 4K sectors, where the interface to
the computer transfers 4K chunks.  That avoids the alignment issue, but
requires filesystem/kernel support.  I am pretty sure netbsd-7 is ok
with that but I am not sure about earlier.
It would probably be possible to add a call into drivers to return this
info and propagate it up and have newfs/fdisk query it.  I am not sure
that all disks return the info enough, and there are probably a lot of
details.  But it's more work and doesn't necessarily do better than
"just start at 2048 and use big blocks".  Certainly you are welcome to
read the code and think about it if this interests you - just explaining
why I think no one has done the work so far.
> Is it all about waste at the file system layer due to some block
> operations being optimized for large devices and buffers but not being
> as applicable (or being downright wasteful) on smaller block devices?
I geuss you can put it that way, saying otherwise we would always start
at 2048 and use 32K or even 64K blocks.   But I think part of it is
inertia.  And the the 63 start dates back to floppies that had 63-sector
tracks - so it was actually aligned.
Attachment:
signature.asc
Description: PGP signature