Re: LFS thoughts

To: David Holland <dholland-tech%netbsd.org@localhost>
Subject: Re: LFS thoughts
From: Konrad Schroder <perseant%hhhh.org@localhost>
Date: Tue, 26 Aug 2025 20:55:51 -0700 (PDT)

On Tue, 26 Aug 2025, David Holland wrote:

On Fri, Aug 22, 2025 at 06:09:31PM -0700, Konrad Schroder wrote:
> [...] These blocks almost immediately
> become stale again, leaving the newly compacted segment looking as if it
> needs cleaning again.

[...]
Doing anything about it is going to be hard, though. On the one hand,
the idea that the segments are in any kind of order and that it
matters where new segment data gets written is a fiction, and we could
probably clean out the last remnants of pretending it matters.

On the other hand, all of this is going to make recovery a lot harder.

I'm not suggesting we get rid of partial-segment sequencing; that would bea much more radical change (maybe ending up with something like WAFL). Ifwe didn't have serial numbers, recovery would be, yes, very difficult.

The current LFS code is pretty vague on the concept of filesystem
transactions and already doesn't really handle multiple-metadata
operations ("dirops") well (this is something I had been meaning to
work on)... basically as things stand you need all the blocks for a
single operation to end up in the same segment so that every segment
is complete, and this is handled reasonably well for indirect blocks
but a mess for anything that touches more than one inode. Adding more
complexity to that tracking without cleaning it out thoroughly first
seems like a bad plan.

Hm, this part of the design always made sense to me, though I'm sure thereis room for improvement. They don't really need to be in the samesegment. Directory operations are written into partial-segments markedwith SS_DIROP and, if the data extends to the next partial-segment becausethe whole set wouldn't fit in a single segment, with SS_CONT. A sequenceof partial-segments marked SS_DIROP is valid for roll-forward iff the lastpartial-segment in the group is not also marked SS_CONT. For this towork, the roll-forward agent needs to process partial-segments in serialnumber order, including skipping to the next-written segment when itreaches end of segment; but the result of either accepting or rejectingthe last group of partial-segments will be consistent either way. (Itwould, of course, be possible to write each directory operation into itsown partial-segment, but that would have performance ramifications: atvery least, one dirop write would have to be queued before a second couldbegin, and the segments containing many dirops would also have manypartial-segment headers, so they would be cleaned soon after they werecreated.)

The "ibis" approach of duplicating a static Ifile would not affect datafoll forward almost at all. The roll-forward agent would check which ofthe two superblocks has the lower serial number, and use the Ifile pointedto by that superblock, just as we do now. That older checkpoint is knownto be complete, and therefore consistent. Roll forward would proceed fromthere. The only extra step would be copying the selected Ifile back intothe other location. (I should, however, point out that ibis isincompatible with keeping the inodes in the ifile: changes to file dataand length would be recoverable by roll forward but changes to other inodeattributes would be lost.)

The "orthos" approach of keeping the ifile (not including inodes) in itsown separate log would also make little difference to roll forward,because the ifile segments would never show up in the roll-forward agent'sview of partial-segments to be processed. If inodes *were* contained inthe ifile log, roll forward would need to read through both logs together,in lock step, to pick up inode changes; that would complicate things abit, yes, but it could be ameliorated by keeping their serial numbers insync.

[...] But even to the extent that's feasible to
implement it's going to generate bazillions of little partial ifile
segments, and that doesn't seem like a great idea. However, anything
other than a 1-1 correspondence is going to incur a lot of on-disk
complexity that seems like it would require major format changes.

The more I think about this the less convinced I am that putting inodesinto the Ifile is a good idea. We'd want to retain the ability to recoverfrom the log, which would mean more frequent ifile writes, which mightlose the cleaning efficiency improvement we'd hoped to gain by moving theinodes out of the data segments.

I suppose one could also just entirely drop the ability to roll
forward from a checkpoint, but that also doesn't seem terribly
desirable.


No, we don't want to lose that.

For a _separate_ ifile ("Ibis") you'd have to reconstruct the ifile
during roll-forward by scanning each segment. That might be possible
(I forget to what extent the current metadata supports that, but it'd
at most require minor format changes) and with a reasonable checkpoint
frequency it shouldn't be that expensive.

Yes, this is what we do now (or more correctly, "how it is supposed towork now" since we've basically never had a working roll-forward). Noformat changes are required for this.

However, this scheme does
require writing out the whole ifile twice for every checkpoint and
on what constitute reasonable-size volumes these days that'd be
hundreds of megabytes. That seems like a stopper.

You'd only need to write the dirty blocks of the ifile; just to bothlocations (write all dirty ifile blocks to location 0, wait forcompletion, update superblock 0; write all dirty blocks to location 1,wait for completion, update superblock 1). It would definitely increasethe time required to make a checkpoint, but not as much as writing thewhole file twice every time. It would be useful to do this allasynchronously, which would mean one of: forbidding changes to any of thedirty blocks while any are still in transit; keeping a second copy of allthe written blocks in memory all that time (presumably reserving memoryfor that purpose to avoid deadlock); or allowing the checkpoint code toread the written data back from location 0 to write it to location 1.


[...]

As I've said before, I think the part of the cleaner that cleans a
segment should be in the kernel, for both robustness and performance
reasons. The part that decides which segments to clean when can and
should stay in userland.

I agree we should have the ability for userland to say to the kernel,"clean these segments" or "defragment that file", and there are quite alot of different policies that could be used. This is why I alwaysadvocated for a userland cleaner. But if those two functions were in thekernel already, it wouldn't take much to add a simple in-kernel cleanerthat a userland cleaner could turn off if it wanted to impose a cleaningpolicy other than the default.

I had some [ideas] but it's going to take a while to page them in...


Thanks,
						Konrad Schroder
						perseant%hhhh.org@localhost

References:
- LFS thoughts
  - From: Konrad Schroder
- Re: LFS thoughts
  - From: David Holland

Prev by Date: Re: All processes run on only one CPU
Next by Date: re: All processes run on only one CPU
Previous by Thread: Re: LFS thoughts
Next by Thread: help in efiboot
Indexes:

Home | Main Index | Thread Index | Old Index