tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: posix_fallocate



>> That's a reason to put it in the kernel, actually.  The kernel can
>> tell which ranges of a file have already had space allocated for
>> them, so it can be just writes.
> Sure, but that's just an optimisation, and surely only matters if
> this is done enough that the difference is actually of significance.
> Do you really believe it is, or ever will be?

Not just an optimization; it also affects correctness - see below.

>> But the obvious answer to this is atomicity, especially in the
>> presence of other writers:
> Really?  You're imagining multiple writers, writing to the same file,
> without any co-ordination (like locking, or whatever other way works)
> and you're actually worried about unpredictable results???  Really?

Actually, thinking about it more, atomicity is the wrong word.

The correct thing to worry about here is that, as I read the pointed-to
webpage, posix_fallocate() is defined to do nothing to already-present
data.  But, when racing with another writer, a non-kernel
implementation will always have conditions under which it can destroy
data written by some other process.

>> No, it doesn't mean that, strictly.  But any syscall that feels like
>> it can be signalable during any sleep involved in its operation;
> Sure, though the internal implementation of the sys call has to
> explicitly make that happen

Well, sure.

> - and to be useful, the sys call interface really has to be designed
> to support it.   This one isn't (and as specified, cannot be.)

Huh?  I don't see it that way.

> That is, if you imagine the [signal] in question being SIGKILL (as in
> the scenario I postulated initially) there's no problem - if the code
> for the sys call wants to allow that to work before it completes, it
> easily can.

> But instead imagine it is SIGALRM - now there's a problem, since
> there's no way in the interface to report how much of the work was
> done, the only thing that can be done if EINTR happens, is repeat the
> sys call.

Right.  That's a reason to put it in the kernel, so it doesn't need to
redo past work.  (This is not a complete fix, because it applies only
to calls which don't increase st_size.)

I note that the documentation webpage lists EINTR.

> For a periodic [signal] like SIGALRM (or any of the other timer
> signals) chances are that a signal will arrive during the sys call
> every time, resulting in an infinite loop of
> fallocate/signal/fallocate/signal...

Sure.  There are lots of ways programmers can write code which ends up
in livelock.  I don't see how this one deserves any more special
treatment than the others.

> [...] - cases where actual holes in the middle of a file are being
> filled in I see as so unlikely in practice that they're totally
> irrelevant (like unless the app has just made the file by seeking
> forward and writing a byte, how does it ever know whether or not
> there are holes to fill?  And why would it do it that way, and follow
> by fallocate()?  If fallocate() exists, surely it would just use that
> to make the file?)

I see it as being intended for programs doing things like databases:
they may want to allocate disk space when they know they'll want it but
it's still easy to back out of the operation if it fails.  Once the
space is allocated, then they can carry on knowing they won't run into
a full disk partway through, later, when it's much harder to deal with.

In this paradigm, the application doesn't know whether there used to be
a hole there and doesn't care; the important thing is that, after the
call, there isn't.

I'm not sure why it's better than (read-and-)write for such
application, but that's the use case it feels designed for to me.

(Also, seeking and writing is not the only way to create a large file;
truncate/ftruncate can extend files on at least some systems.)

> ps: Note that I see that the linux way of handling fallocate isn't to
> write blocks of 0's, but to allocate uninitialised blocks, and mark
> them uninitialised [...]  That way they make fallocate() really fast
> (just assigns block numbers to the file) but it requires that the
> un-init flag (wheverever, and however, they keep that) is 100%
> reliable.  Nothing is really that reliable...

Not 100% reliable, but at least as reliable as the rest of the
filesystem.  FFS assumes that di_db[] in inodes, and the block
allocation bitmaps, won't change behind its back, too; I don't see why
this would be any different, really.

I'm not sure whether I'd be willing to pay one more bit per frag in
order to (greatly) speed up reads of allocated but unwritten blocks; my
own guess would be that such things are rare enough that optimizing
them doesn't really matter - though, of course, I don't often find uses
for things I don't have.  Perhaps Linux has found a real use for such
things.  The major use I can think of for them are things like NFS
swapfiles, where you want to allocate the whole file but have no need
to write it.  My own livebackup is in a similar situation; it could
benefit from an "allocated but unwritten" state for file data blocks.
Neither of those is common enough to seem worth the price to me (in
extra record-keeping data space, and the code to do it).

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Home | Main Index | Thread Index | Old Index