Re: LFS thoughts

To: Konrad Schroder <perseant%hhhh.org@localhost>
Subject: Re: LFS thoughts
From: David Holland <dholland-tech%netbsd.org@localhost>
Date: Tue, 26 Aug 2025 08:42:42 +0000
On Fri, Aug 22, 2025 at 06:09:31PM -0700, Konrad Schroder wrote:
 > I've been thinking about LFS off and on for a while now, and I'd like to
 > run a few of my thoughts by everyone else.? Since the last time I looked
 > closely at the code base, there have been quite a number of improvements by
 > some very good people! but it still has some issues.
 > 
 > 1) The most vexing outstanding issue, in my mind, is the fact that the
 > cleaner often cannot improve the amount of available space on disk.? This
 > is largely due to volatile metadata, in particular index file blocks, being
 > written into the segments while cleaning.? These blocks almost immediately
 > become stale again, leaving the newly compacted segment looking as if it
 > needs cleaning again.? (When the filesystem is empty, this is not a big
 > deal, but when it approaches full it's a killer.)? The same is true of
 > inode blocks and indirect blocks, though to a lesser extent.? If the index
 > file could be segregated from the regular file data, it would help the
 > situation immensely.

Hmm. I hadn't been aware that's a problem, or rather, that's the state
it gets into that causes this problem.

Doing anything about it is going to be hard, though. On the one hand,
the idea that the segments are in any kind of order and that it
matters where new segment data gets written is a fiction, and we could
probably clean out the last remnants of pretending it matters.

On the other hand, all of this is going to make recovery a lot harder.
The current LFS code is pretty vague on the concept of filesystem
transactions and already doesn't really handle multiple-metadata
operations ("dirops") well (this is something I had been meaning to
work on)... basically as things stand you need all the blocks for a
single operation to end up in the same segment so that every segment
is complete, and this is handled reasonably well for indirect blocks
but a mess for anything that touches more than one inode. Adding more
complexity to that tracking without cleaning it out thoroughly first
seems like a bad plan.

Once you start putting pieces of operations in multiple segments,
there has to be enough on-disk structure to keep track of which pairs
or groups of segments are required to be taken or discarded together.
If you always write an ifile segment and a data segment at the same
time, and make sure each has only and exactly the data corresponding
to the other one, it's probably sufficient to add some info to the
segment summaries so that if only one of the segments makes it out you
just drop the other. But even to the extent that's feasible to
implement it's going to generate bazillions of little partial ifile
segments, and that doesn't seem like a great idea. However, anything
other than a 1-1 correspondence is going to incur a lot of on-disk
complexity that seems like it would require major format changes.

I suppose one could also just entirely drop the ability to roll
forward from a checkpoint, but that also doesn't seem terribly
desirable.

For a _separate_ ifile ("Ibis") you'd have to reconstruct the ifile
during roll-forward by scanning each segment. That might be possible
(I forget to what extent the current metadata supports that, but it'd
at most require minor format changes) and with a reasonable checkpoint
frequency it shouldn't be that expensive. However, this scheme does
require writing out the whole ifile twice for every checkpoint and
on what constitute reasonable-size volumes these days that'd be
hundreds of megabytes. That seems like a stopper.

(I don't see any way to update a fixed ifile on the fly without some
way to journal the updates, which we don't have without major
changes.)

And, I never did understand why the ifile is a file of inode locations
instead of a file of inodes. It lets you write out exactly the inodes
that have changed, and not others that just happen to be physically
next to them. But it seems like there are other ways to arrange that.

 > 2) Connecting dirty pages directly to buffer headers when writing might be
 > resulting in incorrect partial-segment checksums.? I can't be sure that
 > that is the cause, but the checksums are definitely sometimes incorrect
 > even when the segments were written (for all I can tell) properly. This
 > would interfere with roll-forward, but more importantly, if the cleaner is
 > paying attention to the checksums as it ought, then those segments might
 > become uncleanable.? Before UBC, lfs_writeseg() freed data buffers by
 > copying their data into larger, pre-reserved buffers before checksumming
 > the lot and sending it to disk.? This also frees up the buffers/pages very
 > quickly compared to waiting for the disk, though of course at the expense
 > of CPU and reserved memory.

I have no idea about this.

 > 3) Roll-forward and some form of cleaning should be moved in-kernel.? I
 > already have code for in-kernel roll forward past the second checkpoint
 > that I need to dust off, test and commit.? Cleaning is trickier because an
 > in-kernel cleaner would be less flexible, but the basic cleaning and
 > defragmenting functionality should be there.

As I've said before, I think the part of the cleaner that cleans a
segment should be in the kernel, for both robustness and performance
reasons. The part that decides which segments to clean when can and
should stay in userland.

 > There has been quite a lot of work on LFS in the last 20 years, some with
 > hints of a roadmap.? Does anyone else have specific ideas about the most
 > glaring issues, or what should be done next?

I had some but it's going to take a while to page them in...

-- 
David A. Holland
dholland%netbsd.org@localhost
Follow-Ups:
- Re: LFS thoughts
  - From: Konrad Schroder
References:
- LFS thoughts
  - From: Konrad Schroder
Prev by Date: Re: LFS thoughts
Next by Date: help in efiboot
Previous by Thread: Re: LFS thoughts
Next by Thread: Re: LFS thoughts
Indexes:
Home | Main Index | Thread Index | Old Index