tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: CVS commit: src/bin/cp



(adding tech-kern because this seems likely to become lengthy; if
following up please drop source-changes-d)

On Sun, Oct 24, 2010 at 11:30:43AM +0100, David Laight wrote:
 > > > [mmap mode disabled in cp due to long vnode lock waits]
 > >
 > > Because individual write() calls are supposed to be atomic, I don't
 > > think there is such a thing as a locking improvement that'll help with
 > > this behavior. :-/
 > 
 > I think write() only needs to lock the the file enough to ensure that
 > the file offset is correct.

No, since in general the file is also being extended (certainly in
this case it is) it also has to lock the file size, and that's going
to deny stat() until it's done.

 > Possibly the written range needs locking against other accesses -
 > but I think the app is supposed to use file locking for that

No, the reason single syscalls are atomic is specifically so apps
don't have to use file locking for simple cases, like e.g. writing
lines to log files.

 >  (and mmap will always be non-atomic w.r.t. write).

Yes, but the mmap in this case is for reading.

(what the code in cp does is map the source and then write to the
destination file from the mapped region)

 > Actually if 2 writes are issued for the same part of a file, the
 > kernel can act as if they were requested in either order - since the
 > app(s) cannot know the order the calls would be made in!

As long as nothing else happens in between that establishes an
ordering, yes.

 > Which means it could just sleep the 2nd write until the first terminates!

Yes indeed, that's what locks do :-)

 > Writes with O_APPEND (and writes that extend the file) are more
 > problematical since you cant allow a second such write to start until
 > the first has completed - for instance it might try to read from
 > an unmapped user-space address and return a short length.

Right.

 > > Except I guess going to some kind of multiversion model for vnodes.
 > 
 > Don't you just need 2 locks? One for locking the data areas, and the
 > other for the file data itself.

Nope. Or at least, that doesn't fix the objectionable behavior.
Because you need to lock the file size, if you want stat() to be able
to complete while the operation is in progress you need to be able to
let stat() read a version of the file size from before the operation
began.

One can do multiversion schemes in the kernel; apparently it's neither
as ghastly or as expensive as one would think. See for example the
transactional Linux paper from SOSP 2009. But it'd be a pretty big
hacking job to do it in NetBSD.

Anyway, ISTM that writing from the mmap buffer in say 64K chunks would
retain nearly all the advantages and get rid of the latency problem.

-- 
David A. Holland
dholland%netbsd.org@localhost


Home | Main Index | Thread Index | Old Index