tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: LFS thoughts
On Tue, 26 Aug 2025, David Holland wrote:
On Fri, Aug 22, 2025 at 06:09:31PM -0700, Konrad Schroder wrote:
> [...] These blocks almost immediately
> become stale again, leaving the newly compacted segment looking as if it
> needs cleaning again.
[...]
Doing anything about it is going to be hard, though. On the one hand,
the idea that the segments are in any kind of order and that it
matters where new segment data gets written is a fiction, and we could
probably clean out the last remnants of pretending it matters.
On the other hand, all of this is going to make recovery a lot harder.
I'm not suggesting we get rid of partial-segment sequencing; that would be
a much more radical change (maybe ending up with something like WAFL). If
we didn't have serial numbers, recovery would be, yes, very difficult.
The current LFS code is pretty vague on the concept of filesystem
transactions and already doesn't really handle multiple-metadata
operations ("dirops") well (this is something I had been meaning to
work on)... basically as things stand you need all the blocks for a
single operation to end up in the same segment so that every segment
is complete, and this is handled reasonably well for indirect blocks
but a mess for anything that touches more than one inode. Adding more
complexity to that tracking without cleaning it out thoroughly first
seems like a bad plan.
Hm, this part of the design always made sense to me, though I'm sure there
is room for improvement. They don't really need to be in the same
segment. Directory operations are written into partial-segments marked
with SS_DIROP and, if the data extends to the next partial-segment because
the whole set wouldn't fit in a single segment, with SS_CONT. A sequence
of partial-segments marked SS_DIROP is valid for roll-forward iff the last
partial-segment in the group is not also marked SS_CONT. For this to
work, the roll-forward agent needs to process partial-segments in serial
number order, including skipping to the next-written segment when it
reaches end of segment; but the result of either accepting or rejecting
the last group of partial-segments will be consistent either way. (It
would, of course, be possible to write each directory operation into its
own partial-segment, but that would have performance ramifications: at
very least, one dirop write would have to be queued before a second could
begin, and the segments containing many dirops would also have many
partial-segment headers, so they would be cleaned soon after they were
created.)
The "ibis" approach of duplicating a static Ifile would not affect data
foll forward almost at all. The roll-forward agent would check which of
the two superblocks has the lower serial number, and use the Ifile pointed
to by that superblock, just as we do now. That older checkpoint is known
to be complete, and therefore consistent. Roll forward would proceed from
there. The only extra step would be copying the selected Ifile back into
the other location. (I should, however, point out that ibis is
incompatible with keeping the inodes in the ifile: changes to file data
and length would be recoverable by roll forward but changes to other inode
attributes would be lost.)
The "orthos" approach of keeping the ifile (not including inodes) in its
own separate log would also make little difference to roll forward,
because the ifile segments would never show up in the roll-forward agent's
view of partial-segments to be processed. If inodes *were* contained in
the ifile log, roll forward would need to read through both logs together,
in lock step, to pick up inode changes; that would complicate things a
bit, yes, but it could be ameliorated by keeping their serial numbers in
sync.
[...] But even to the extent that's feasible to
implement it's going to generate bazillions of little partial ifile
segments, and that doesn't seem like a great idea. However, anything
other than a 1-1 correspondence is going to incur a lot of on-disk
complexity that seems like it would require major format changes.
The more I think about this the less convinced I am that putting inodes
into the Ifile is a good idea. We'd want to retain the ability to recover
from the log, which would mean more frequent ifile writes, which might
lose the cleaning efficiency improvement we'd hoped to gain by moving the
inodes out of the data segments.
I suppose one could also just entirely drop the ability to roll
forward from a checkpoint, but that also doesn't seem terribly
desirable.
No, we don't want to lose that.
For a _separate_ ifile ("Ibis") you'd have to reconstruct the ifile
during roll-forward by scanning each segment. That might be possible
(I forget to what extent the current metadata supports that, but it'd
at most require minor format changes) and with a reasonable checkpoint
frequency it shouldn't be that expensive.
Yes, this is what we do now (or more correctly, "how it is supposed to
work now" since we've basically never had a working roll-forward). No
format changes are required for this.
However, this scheme does
require writing out the whole ifile twice for every checkpoint and
on what constitute reasonable-size volumes these days that'd be
hundreds of megabytes. That seems like a stopper.
You'd only need to write the dirty blocks of the ifile; just to both
locations (write all dirty ifile blocks to location 0, wait for
completion, update superblock 0; write all dirty blocks to location 1,
wait for completion, update superblock 1). It would definitely increase
the time required to make a checkpoint, but not as much as writing the
whole file twice every time. It would be useful to do this all
asynchronously, which would mean one of: forbidding changes to any of the
dirty blocks while any are still in transit; keeping a second copy of all
the written blocks in memory all that time (presumably reserving memory
for that purpose to avoid deadlock); or allowing the checkpoint code to
read the written data back from location 0 to write it to location 1.
[...]
As I've said before, I think the part of the cleaner that cleans a
segment should be in the kernel, for both robustness and performance
reasons. The part that decides which segments to clean when can and
should stay in userland.
I agree we should have the ability for userland to say to the kernel,
"clean these segments" or "defragment that file", and there are quite a
lot of different policies that could be used. This is why I always
advocated for a userland cleaner. But if those two functions were in the
kernel already, it wouldn't take much to add a simple in-kernel cleaner
that a userland cleaner could turn off if it wanted to impose a cleaning
policy other than the default.
I had some [ideas] but it's going to take a while to page them in...
Thanks,
Konrad Schroder
perseant%hhhh.org@localhost
Home |
Main Index |
Thread Index |
Old Index