tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: posix_fallocate



    Date:        Sun, 17 Nov 2013 14:12:16 -0500 (EST)
    From:        Mouse <mouse%Rodents-Montreal.ORG@localhost>
    Message-ID:  <201311171912.OAA17183%Chip.Rodents-Montreal.ORG@localhost>

  | That's a reason to put it in the kernel, actually.  The kernel can tell
  | which ranges of a file have already had space allocated for them, so it
  | can be just writes.

Sure, but that's just an optimisation, and surely only matters if this
is done enough that the difference is actually of significance.  Do you
really believe it is, or ever will be?

  | Well, I don't know what the point of having posix_fallocate at all
  | would be.

Agree there, it seems mostly useless.

  | But the obvious answer to this is atomicity, especially in
  | the presence of other writers:

Really?  You're imagining multiple writers, writing to the same file,
without any co-ordination (like locking, or whatever other way works) and
you're actually worried about unpredictable results???   Really?

But beyond that, since you plan on allowing signals to interrupt the
operation part way through, what atomicity are you really getting anyway?

  | No, it doesn't mean that, strictly.  But any syscall that feels like it
  | can be signalable during any sleep involved in its operation;

Sure, though the internal implementation of the sys call has to explicitly
make that happen - and to be useful, the sys call interface really has to
be designed to support it.   This one isn't (and as specified, cannot be.)

That is, if you imagine the sys call in question being SIGKILL (as in the
scenario I postulated initially) there's no problem - if the code for the
sys call wants to allow that to work before it completes, it easily can.

But instead imagine it is SIGALRM - now there's a problem, since there's
no way in the interface to report how much of the work was done, the only
thing that can be done if EINTR happens, is repeat the sys call.  For
a periodic sys call like SIGALRM (or any of the other timer signals) chances
are that a signal will arrive during the sys call every time, resulting
in an infinite loop of fallocate/signal/fallocate/signal...

This thing is just poorly designed.  Let's just ignore it (by all means,
implement the hole making part from the linux interface if desired, but
the allocation side just isn't needed).

  | Given what it does, a half-completed posix_fallocate is indistinguishable,
  | to userland, from a never-started posix_fallocate, provided the former
  | hasn't got as far as affecting st_size.

Of course, I'm only really interested in cases where the size can't help
being affected, as it starts at 0 - cases where actual holes in the middle
of a file are being filled in I see as so unlikely in practice that they're
totally irrelevant (like unless the app has just made the file by seeking
forward and writing a byte, how does it ever know whether or not there are
holes to fill?   And why would it do it that way, and follow by fallocate() ?
If fallocate() exists, surely it would just use that to make the file?)

But as above, that it is indistinguishable is the problem.

kre

ps: Note that I see that the linux way of handling fallocate isn't to
write blocks of 0's, but to allocate uninitialised blocks, and mark them
uninitialised - I assume the way that works, is that if an app reads one
of those blocks, it is just given 0's - and if it writes (the expected
operation) whatever is there (0's or junk) just gets overwritten.  That
way they make fallocate() really fast (just assigns block numbers to the file)
but it requires that the un-init flag (wheverever, and however, they keep that)
is 100% reliable.   Nothing is really that reliable...   FFS doesn't have
any mechanism to do that, so actually writing 0's would be the only way,
and given that, fallocate() looks to be a total waste of time - again,
given that it is an optional sys call, that no-one is required to implement,
and so which no-one can assume actually exists.



Home | Main Index | Thread Index | Old Index