netbsd-ports: Re: Defragmenting a NetBSD drive

Subject: Re: Defragmenting a NetBSD drive
To: Guenther Grau , Todd Whitesel <toddpw@best.com>
From: Robert Elz <kre@munnari.OZ.AU>
List: netbsd-ports
Date: 09/16/1999 22:19:39
    Date:        Thu, 16 Sep 1999 11:58:18 +0200
    From:        Guenther Grau <Guenther.Grau@bosch.com>
    Message-ID:  <37E0BF3A.24E35B@bosch.com>

  | Hmm, I don't quite understand that. If an application is scanning a
  | directory while an entry (yes, I this time I got it right :-) is
  | deleted from/added to it, the directory structure is changed behind
  | the application. Why does it make a difference to application if the
  | creat() code does the compaction instead of the unlink() code?

Now, the entry for a file that is in the process of being deleted or
created has a chance of being missed in a scan (if it is created just
after) or seen when it is no longer there (if it was removed just after)
but the rest of the directory is safe - all files that weren't being
touched will appear correctly.

As soon as you start moving about the entries for other files, that won't
be true any more.   You could have a file that had been sitting there
untouched for months, that this week just misses being included in the
locate database, for no better reason than that some other file was being
added or deleted at just the wrong instant.   I hate to think what the
effects might be on spool type directories (mqueue etc) which are constantly
being scanned while files are being added and deleted.

Of course, the directory semantics could be altered to allow this to be
OK, such that a scan would need to be aborted, and restart on a directory
if the directory is altered - find, tar, ls, ... could all be taught to
check this.   Is it really worth it?

  | Well not  but at unlink() time. :-) Right now it happens
  | "arbitrarily" as creat() time ;-)

The only thing that happens then is that some empty directory blocks
might be lopped off the end of the directory.   They contain nothing.
Whether a scan gets to read them or not shouldn't normally matter...

Todd Whitesel <toddpw@best.com> said:
  | Hmm, this would mean that 'rm *' in a directory is O(N^2) in the
  | number of directory entries, because the later unlink()s have to scan
  | through lots of empty directory entries before they get to the next
  | undeleted file.

No, rm * is O(n).   What you're concerned with is the constant that
affects the operation.

On the other hand, rm -r would be, of itself O(n^2) (with a much
smaller constant) if it weren't for the per process namei cache that Kirk
McKusick added.

That is, in rm * the files are being deleted in essentially random order
as far as the directory is concerned.   To delete each file takes (on
average) a scan of one half the length of the directory.  Since the length
of a directory doesn't change while files are being deleted, that is a
constant - the time taken scales linearly with the number of files deleted
or with the size of the directory.   (It is actually slightly more complex
than that, as empty adjacent directory slots are combined into one, so things
actually do get a little faster as there are less files in the directory -
just not quite as much faster as you might hope).

rm -r on the other hand deletes the files in directory order.   If each
delete had to start at the front of the directory, and read until it
found the file, that would be an O(n^2) operation - later files take much
longer to delete than the earlier ones.   But McKusick's per process
namei cache avoids that by starting each scan where the previous one
left off.   For rm -r (or the INN's fastrm which was created precisely to
exploit this characteristic) eachfile removed is just after the previous
one, so is found very quickly (the operation is actually O(s) on the
size of the directory, rather than the number of files being removed).
This makes no difference to 'rm *' and other random operations, as by
definition, with random input, it doesn't matter where you start, it
will always take (on average) a half directory scan.

  | Could the ffs code be made to cope with directories that get truncated
  | from the _front_ even if that means they're technically sparse files?
  | Then unlink() could trim blocks at the front of directories when they
  | empty out, thus skipping to the meat of the directory much faster.

It could, though last time I looked, kernels panic'd when faced with holy
directories.   That isn't supposed to happen.   That could be fixed with
little pain however.   However before doing so, you really need to determine
if that small effort (and lost diagnostic ability) would give enough gain
to be worth the bother.   This means a comprehensive examination of just
how directories actually behave in real systems, and whether enough cases
ever actually occur where you would get any real win out of doing this to
justify the cost.   That means directories that grow lots of files which
are slowly removed (and not replaced) in approximately the order in which
they're added.   If the concern is 'rm *' then 'rm -r .' gets you the
per process namei cache win, which is much better, so there's no win there.
For other cases, I can't really think of a directory which grows lots of
files, which are then all removed more or less in order (with no new ones
being created which would require that any holes made be filled again).

  | Seems to me that an atomic operation could be provided that does some
  | amount of O(1) garbage collection (like finding the first empty slot
  | in a directory, and moving the first non-empty slot into it) and
  | returns a boolean to let you know if you should call again or move on.

Sure - but it is a change of directory semantic to the applications.

  | While I agree that the kernel deciding to do it is infinitely worse, I
  | don't really agree that it's OK because a human decided to do it. As a
  | programmer I would much rather have some guarantees about not seeing
  | the same file twice. I seem to recall having problems with 'ps' not
  | giving real snapshots of the process table either.

Sure.   But what you're discussing is what the semantics ought to be.
That's a fine discussion to have.   Right now what concerns me more is
that no-one goes altering the semantics that we have (attempting to optimise
something which very probably doesn't need much optimisation).

There's no question but there are race conditions around, but they're all
ones which have been around since day 1, and which we have become accustomed
to.   That doesn't mean they're good - but when they're a problem, they
can generally be avoided.

kre