tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: LFS thoughts



On Tue, 26 Aug 2025, David Holland wrote:

On Fri, Aug 22, 2025 at 06:09:31PM -0700, Konrad Schroder wrote:
> [...] These blocks almost immediately
> become stale again, leaving the newly compacted segment looking as if it
> needs cleaning again.

[...]
Doing anything about it is going to be hard, though. On the one hand,
the idea that the segments are in any kind of order and that it
matters where new segment data gets written is a fiction, and we could
probably clean out the last remnants of pretending it matters.

On the other hand, all of this is going to make recovery a lot harder.

I'm not suggesting we get rid of partial-segment sequencing; that would be a much more radical change (maybe ending up with something like WAFL). If we didn't have serial numbers, recovery would be, yes, very difficult.

The current LFS code is pretty vague on the concept of filesystem
transactions and already doesn't really handle multiple-metadata
operations ("dirops") well (this is something I had been meaning to
work on)... basically as things stand you need all the blocks for a
single operation to end up in the same segment so that every segment
is complete, and this is handled reasonably well for indirect blocks
but a mess for anything that touches more than one inode. Adding more
complexity to that tracking without cleaning it out thoroughly first
seems like a bad plan.

Hm, this part of the design always made sense to me, though I'm sure there is room for improvement. They don't really need to be in the same segment. Directory operations are written into partial-segments marked with SS_DIROP and, if the data extends to the next partial-segment because the whole set wouldn't fit in a single segment, with SS_CONT. A sequence of partial-segments marked SS_DIROP is valid for roll-forward iff the last partial-segment in the group is not also marked SS_CONT. For this to work, the roll-forward agent needs to process partial-segments in serial number order, including skipping to the next-written segment when it reaches end of segment; but the result of either accepting or rejecting the last group of partial-segments will be consistent either way. (It would, of course, be possible to write each directory operation into its own partial-segment, but that would have performance ramifications: at very least, one dirop write would have to be queued before a second could begin, and the segments containing many dirops would also have many partial-segment headers, so they would be cleaned soon after they were created.)

The "ibis" approach of duplicating a static Ifile would not affect data foll forward almost at all. The roll-forward agent would check which of the two superblocks has the lower serial number, and use the Ifile pointed to by that superblock, just as we do now. That older checkpoint is known to be complete, and therefore consistent. Roll forward would proceed from there. The only extra step would be copying the selected Ifile back into the other location. (I should, however, point out that ibis is incompatible with keeping the inodes in the ifile: changes to file data and length would be recoverable by roll forward but changes to other inode attributes would be lost.)

The "orthos" approach of keeping the ifile (not including inodes) in its own separate log would also make little difference to roll forward, because the ifile segments would never show up in the roll-forward agent's view of partial-segments to be processed. If inodes *were* contained in the ifile log, roll forward would need to read through both logs together, in lock step, to pick up inode changes; that would complicate things a bit, yes, but it could be ameliorated by keeping their serial numbers in sync.

[...] But even to the extent that's feasible to
implement it's going to generate bazillions of little partial ifile
segments, and that doesn't seem like a great idea. However, anything
other than a 1-1 correspondence is going to incur a lot of on-disk
complexity that seems like it would require major format changes.

The more I think about this the less convinced I am that putting inodes into the Ifile is a good idea. We'd want to retain the ability to recover from the log, which would mean more frequent ifile writes, which might lose the cleaning efficiency improvement we'd hoped to gain by moving the inodes out of the data segments.

I suppose one could also just entirely drop the ability to roll
forward from a checkpoint, but that also doesn't seem terribly
desirable.

No, we don't want to lose that.

For a _separate_ ifile ("Ibis") you'd have to reconstruct the ifile
during roll-forward by scanning each segment. That might be possible
(I forget to what extent the current metadata supports that, but it'd
at most require minor format changes) and with a reasonable checkpoint
frequency it shouldn't be that expensive.

Yes, this is what we do now (or more correctly, "how it is supposed to work now" since we've basically never had a working roll-forward). No format changes are required for this.

However, this scheme does
require writing out the whole ifile twice for every checkpoint and
on what constitute reasonable-size volumes these days that'd be
hundreds of megabytes. That seems like a stopper.

You'd only need to write the dirty blocks of the ifile; just to both locations (write all dirty ifile blocks to location 0, wait for completion, update superblock 0; write all dirty blocks to location 1, wait for completion, update superblock 1). It would definitely increase the time required to make a checkpoint, but not as much as writing the whole file twice every time. It would be useful to do this all asynchronously, which would mean one of: forbidding changes to any of the dirty blocks while any are still in transit; keeping a second copy of all the written blocks in memory all that time (presumably reserving memory for that purpose to avoid deadlock); or allowing the checkpoint code to read the written data back from location 0 to write it to location 1.

[...]
As I've said before, I think the part of the cleaner that cleans a
segment should be in the kernel, for both robustness and performance
reasons. The part that decides which segments to clean when can and
should stay in userland.

I agree we should have the ability for userland to say to the kernel, "clean these segments" or "defragment that file", and there are quite a lot of different policies that could be used. This is why I always advocated for a userland cleaner. But if those two functions were in the kernel already, it wouldn't take much to add a simple in-kernel cleaner that a userland cleaner could turn off if it wanted to impose a cleaning policy other than the default.

I had some [ideas] but it's going to take a while to page them in...

Thanks,
						Konrad Schroder
						perseant%hhhh.org@localhost




Home | Main Index | Thread Index | Old Index