tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: posix_fallocate

>>> [posix_fallocate]
>> We could fork a kernel thread that would go to userspace to do the
>> work with a write() loop, with appropriate credentials.  Does it
>> makes sense?
> It would need to be a read/write loop, nothing says that there cannot
> already be blocks allocated in the space being fallocated, and their
> content should not change.

That's a reason to put it in the kernel, actually.  The kernel can tell
which ranges of a file have already had space allocated for them, so it
can be just writes.  (Well, for local filesystems.  Throw NFS or its
ilk into the mix and it gets more interesting.)

> But if implemented that way, why bother at all?  Why not just put the
> code in a user space libc posix_fallocate() function, and be done
> with it, it should not require any kernel support at all.

Well, I don't know what the point of having posix_fallocate at all
would be.  But the obvious answer to this is atomicity, especially in
the presence of other writers: the kernel is capable of making sure it
doesn't destroy someone else's write by mistake, which userland isn't
(in the userland implementation, there's a window between read and
write when someone else can write only to get overwritten).  If that
matters for the target application, that's a reason to prefer an
in-kernel implementation.

> On the other hand, posix_fallocate() could allocate pitabytes in a
> single invocation of the sys call, assuming that the filesystem had
> that much space available.   I haven't looked recently, but last time
> I did, preemptible sys calls still didn't mean that userland signals
> would be delivered in the middle of the operation of a single sys
> call, [...]

No, it doesn't mean that, strictly.  But any syscall that feels like it
can be signalable during any sleep involved in its operation; this was
true even before multiprocessor support.  There will be a loop involved
_somewhere_ in any posix_fallocate() implementation, and I can't
imagine that an implementation wouldn't sleep somewhere waiting for the
underlying filesystem operations.  Those sleeps can be made
interruptible by signals; at most it will complicate the exit path.

> nor does anything suggest that signals are supposed to interrupt the
> operation of posiz_fallocate() half way through - which suggests to
> me, that as designed, it should continue until it is finished once
> invoked, whatever anyone tries to do to the process that invoked it.

Actually, I see nothing in the description that prevents that.  Given
what it does, a half-completed posix_fallocate is indistinguishable, to
userland, from a never-started posix_fallocate, provided the former
hasn't got as far as affecting st_size.  It would be interesting to
deal with a posix_fallocate that raises st_size being interrupted after
it's written some but not all of the new space, especially in the
presence of another writer writing into the same area of the file, but
I feel certain those problems are solvable, even if it means pushing a
small fraction of the implementation down into the filesystem - and
maybe not even that; the existence of kqueue's EVFILT_VNODE NOTE_WRITE
means that most of the necessary machinery is already in place.

This is not to say that I support the idea of adding posix_fallocate;
I'm not sure what I think on that question.  But some of the arguments
kre has presented here do not, IMO, hold water.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML      
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Home | Main Index | Thread Index | Old Index