Re: RAID stripe size

To: Greg Troxel <gdt%lexort.com@localhost>
Subject: Re: RAID stripe size
From: Robert Elz <kre%munnari.OZ.AU@localhost>
Date: Wed, 30 Apr 2025 03:00:59 +0700

    Date:        Tue, 29 Apr 2025 12:01:05 -0400
    From:        Greg Troxel <gdt%lexort.com@localhost>
    Message-ID:  <rmi7c33554u.fsf%s1.lexort.com@localhost>

  | Edgar Fu�<ef%math.uni-bonn.de@localhost> writes:
  |
  | > On a populated FFS, is there an easy way to determine how many
  | > fragments are in use or how many blocks are split into fragments?
  |
  | dumpfs might help.

Not a lot, unfortunately for this, all you get is the number of
free fragments.   You could try to get some idea from the cylinder
group dumps, looking at the nffree values (last of the 4 values)
but exactly how that relates to the number of used frags, I have
no idea - that says how many free frags are in each CG but not how
many blocks those are spread throughout (how many blocks have been
divided into fragments), which would be needed to calculate (from just
that) the number of allocated fragments.

  | My impression is that files are always the highest
  | number of blocks that fit, and then fragments as needed.

That's correct for small files - ones which need no indirect blocks.
That is, any file no bigger than 12 (UFS_NDADDR from <ufs/dinode.h>
for FFS filesystems anyway) blocks (as in the block size of the filesystem)
might have fragments for the last block, if it is not full.

No files bigger than that ever have fragments.  The indirect blocks,
and all blocks reached via an indirect block, are all full blocks,
never a fragment.   And fragments are only ever right at the end of
a small file, all the earlier blocks (even if mostly full of zeroes/holes)
are full blocks (unless the whole block is a hole, in which case it
is absent).

One could scan the inodes, looking for files with size < 12 * fs-blocksize
and then calculate (size - (size / blocksize) + (fragsize - 1)) / fragsize
(I think that's about right) to give the number of fragments that might be
in use for that file.

  | If the files are large, there shouldn't be that many fragments.

That's correct.

  | fsck (-ny while mounted) reports, on an arbitrarily chosen filesystem:

For me, on a filesystem which has mostly large files (but some small ones):

34110 files, 2770021430 used, 405079512 free (1964 frags, 101269387 blocks, 0.0% fragmentation)

And another (much smaller) which has lots of smaller files (and some
bigger ones):

2366220 files, 122443842 used, 145327675 free (317787 frags, 18126236 blocks, 0.1% fragmentation)

Note that the numbers in parentheses are what is free, that's
exactly the same as dumpfs reports.   I believe the 2nd and 3rd numbers
(used & free) are in units of the fragment size (the minimum allocation
size on the filesystem).

Perhaps surprisingly, the filesystem doesn't really bother keeping track
of how much of anything is allocated ... all that is just dead to the
filesystem, what it cares about is what is free (available for it to use).

  | > Just in case: What this really is about is chosing the RAID stripe
  | > unit size.  On a three-component RAID 5, I think I basically have two
  | > options: make a stripe the size of a fragment or make it the size of a
  | > block.  The data on the RAID set is a Borg backup, and I can examine
  | > data on another backup server. Since backups are mostly written to, I
  | > think I need to optimize for write speed.
  |
  | Usually block/fragment is 8k/1k, or 16k/2k.

For filesystems with large files, I tend to use 32K/4K or 32K/8K
(4K fragments as a minimum, particularly if write speed is important,
make sense on most modern large drives, so the drive does not need to
implement RMW to write smaller blocks).

For the filesystems reported above, I used 32K/8K and 16K/2K.

  | I have the impression RAID5 stripe sizes tend be more like 32k or 64k.

They can be whatever (power of two perhaps) you want them to be,
within reason.   I tend to make RAID L5 (with 3 drives) use the file
system block size / 2 (so one block is one stripe on each of the 2
drives holding data, plus one on whichever is parity for that stripe)
and raidframe doesn't need to do RMW operations on the stripes.

If read speed is more important than write speed, then bigger stripes
make mode sense.

The first filesystem above is on a Raid L5 (raidframe) which is configured
with 16K (32 blocks) stripes, so one filesystem full sized block occupies
exactly 1 stripe unit.   (The other is on a Raid L1 which also has 16K
stripes, so again, one filesystem block is one SU).

  | I am unclear on if ffs tends to allocate consecutive blocks.

They used to, the old maxcontig parameter (-a option to newfs) specified
how many to attempt to use, I believe that has changed since I looked at
how FFS really worked last.   I don't believe almost anyone ever used that
parameter, or attempted to work out what settings made sense in what
situations.

As for the underlying question, unless the vast majority of the files,
to be on the filesystem are quite small, and remember, directories are
files too, but many symlinks no longer are, simply forget about fragments
(even make the frag size == block size if you like) and calculate
everything that matters based upon the block size.   But unless you're
doing a lot of high speed writes, you probably won't detect any real
difference, whatever you set it to (ie: for benchmarking performance it
matters, for using, it generally doesn't).

kre

Follow-Ups:
- Re: FFS fragments (was: RAID stripe size)
  - From: Robert Elz
- FFS fragments (was: RAID stripe size)
  - From: Edgar Fuß
- Re: RAID stripe size
  - From: Mouse

References:
- Re: RAID stripe size
  - From: Greg Troxel
- determining the number of FFS fragments in use
  - From: Edgar Fuß
- RAID stripe size
  - From: Edgar Fuß

Prev by Date: sys_semop() retries redo the semop copyin() - why?
Next by Date: Re: RAID stripe size
Previous by Thread: Re: RAID stripe size
Next by Thread: Re: RAID stripe size
Indexes:

Home | Main Index | Thread Index | Old Index