tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: very bad behavior on overquota writes

On Thu, Nov 22, 2012 at 12:46:54PM +0100, Manuel Bouyer wrote:
> On Wed, Nov 21, 2012 at 03:02:58PM +0100, Manuel Bouyer wrote:
> > Hello,
> > I've been looking at performance issues on our NFS server, which I tracked
> > down to overquota writes. The problem is caused by software that do
> > writes without error checkings. When doing this, the nfsd threads becomes
> > 100% busy, and nfs requests from other clients can de delayed by
> > several seconds.
> > To reproduce this, I've used the attached program. Basically it does an
> > endless write, without error checking. I first ran it on a NFS client 
> > against
> > a test nfs server and could reproduce the problem. The I ran it
> > directly on the server against the ffs-exported filesystem, and
> > could see a similar behavior:
> > When the uid running it is overquota, the process start using 100% CPU in
> > system and the number of write syscall per second drops dramatically (from
> > about 170 to about 20). I can see there is still some write activity on the
> > disk (about 55 KB/s, 76 writes/s).
> > 
> > The problem is that when we notice we can't do the write, ffs_write() 
> > already
> > did some things that needs to be undone. one of them, which is time 
> > consuming,
> > is to trucate the file back to its original size. Most of the time is
> > spent in genfs_do_putpages().
> > 
> > The problem here seems to be that we always to a page list walk because
> > endoff is 0. If the file is large enough to have lots of pages in core,
> > a lot of time is spent here.
> > 
> > The attached patch improves this a bit, by not always using a list walk.
> > but I wonder if this could cause some pages to be lost until the vnode
> > is recycled. AFAIK v_writesize nor v_size can be shrunk without
> > genfs_do_putpages() being called, but I may have missed something.
> Here's another patch, after comments from chs@ in a private mail.
> There really are 2 parts here:
> - avoid unecessery list walk at truncate time. This is the change to
>   uvm_vnp_setsize() (truncate between pgend and oldsize instead of 0, which
>   always triggers a list walk - we know there is nothing beyond oldsize)
>   and ffs_truncate() (vtruncbuf() is needed only for non-VREG files,
>   and will always trigger a list walk).
> This help for the local write case, where on my test system syscalls/s rise
> from 170 (when write succeed) to 570 (when EDQUOT is returned). Without
> this patch, the syscalls/s would fall to less than 20.
> But this doesn't help much for nfsd, which can get write requests way beyond
> the real end of file (the client's idea of end of file is wrong).
> In such a case, the distance between pgend and oldsize in uvm_vnp_setsize()
> is large enough to trigger a list walk, even through there is really only
> a few pages to free. I guess fixing this would require an interface change
> to pass to uvm_vnp_setsize() for which the allocation did really take place.
> But I found a workaround: when a write error occurs, free all the
> pages associated with the vnode in ffs_write(). This way, on subsequent writes
> only new pages should be in the list, and freeing them is fast.
> With this change, the nfsd performances don't change when the quota limit
> is hit.
> Does anyone see a problem with this ?

don't you need a "mutex_enter(vp->v_interlock)" before calling VOP_PUTPAGES()?
(did you test it with LOCKDEBUG?)  otherwise it looks ok to me.

in the places were we skip the list walk because there aren't supposed to be
any pages with offsets larger than some threshold, it would be nice if we could
have a DEBUG check that would verify there actually aren't any pages with
offsets that are larger than that.

I should also mention that this kind of workaround shouldn't be necessary
anymore once the yamt-pagecache branch is merged... there, the uvm_object list
of pages is replaced by a tree so that finding a few pages is fast no matter
how many other pages of that object are present.  I don't know when that
will be ready though.


Home | Main Index | Thread Index | Old Index