Re: very bad behavior on overquota writes

To: tech-kern%NetBSD.org@localhost
Subject: Re: very bad behavior on overquota writes
From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
Date: Thu, 22 Nov 2012 12:46:54 +0100

On Wed, Nov 21, 2012 at 03:02:58PM +0100, Manuel Bouyer wrote:
> Hello,
> I've been looking at performance issues on our NFS server, which I tracked
> down to overquota writes. The problem is caused by software that do
> writes without error checkings. When doing this, the nfsd threads becomes
> 100% busy, and nfs requests from other clients can de delayed by
> several seconds.
> To reproduce this, I've used the attached program. Basically it does an
> endless write, without error checking. I first ran it on a NFS client against
> a test nfs server and could reproduce the problem. The I ran it
> directly on the server against the ffs-exported filesystem, and
> could see a similar behavior:
> When the uid running it is overquota, the process start using 100% CPU in
> system and the number of write syscall per second drops dramatically (from
> about 170 to about 20). I can see there is still some write activity on the
> disk (about 55 KB/s, 76 writes/s).
> 
> The problem is that when we notice we can't do the write, ffs_write() already
> did some things that needs to be undone. one of them, which is time consuming,
> is to trucate the file back to its original size. Most of the time is
> spent in genfs_do_putpages().
> 
> The problem here seems to be that we always to a page list walk because
> endoff is 0. If the file is large enough to have lots of pages in core,
> a lot of time is spent here.
> 
> The attached patch improves this a bit, by not always using a list walk.
> but I wonder if this could cause some pages to be lost until the vnode
> is recycled. AFAIK v_writesize nor v_size can be shrunk without
> genfs_do_putpages() being called, but I may have missed something.

Here's another patch, after comments from chs@ in a private mail.

There really are 2 parts here:
- avoid unecessery list walk at truncate time. This is the change to
  uvm_vnp_setsize() (truncate between pgend and oldsize instead of 0, which
  always triggers a list walk - we know there is nothing beyond oldsize)
  and ffs_truncate() (vtruncbuf() is needed only for non-VREG files,
  and will always trigger a list walk).

This help for the local write case, where on my test system syscalls/s rise
from 170 (when write succeed) to 570 (when EDQUOT is returned). Without
this patch, the syscalls/s would fall to less than 20.

But this doesn't help much for nfsd, which can get write requests way beyond
the real end of file (the client's idea of end of file is wrong).
In such a case, the distance between pgend and oldsize in uvm_vnp_setsize()
is large enough to trigger a list walk, even through there is really only
a few pages to free. I guess fixing this would require an interface change
to pass to uvm_vnp_setsize() for which the allocation did really take place.
But I found a workaround: when a write error occurs, free all the
pages associated with the vnode in ffs_write(). This way, on subsequent writes
only new pages should be in the list, and freeing them is fast.

With this change, the nfsd performances don't change when the quota limit
is hit.

Does anyone see a problem with this ?

-- 
Manuel Bouyer <bouyer%antioche.eu.org@localhost>
     NetBSD: 26 ans d'experience feront toujours la difference
--

Index: uvm/uvm_vnode.c
===================================================================
RCS file: /cvsroot/src/sys/uvm/uvm_vnode.c,v
retrieving revision 1.97.8.1
diff -u -p -u -r1.97.8.1 uvm_vnode.c
--- uvm/uvm_vnode.c     11 Jun 2012 21:25:02 -0000      1.97.8.1
+++ uvm/uvm_vnode.c     22 Nov 2012 11:31:25 -0000
@@ -351,7 +351,7 @@ uvm_vnp_setsize(struct vnode *vp, voff_t
        KASSERT(oldsize != VSIZENOTSET || pgend > oldsize);
 
        if (oldsize > pgend) {
-               (void) uvn_put(uobj, pgend, 0, PGO_FREE | PGO_SYNCIO);
+               (void) uvn_put(uobj, pgend, oldsize, PGO_FREE | PGO_SYNCIO);
                mutex_enter(uobj->vmobjlock);
        }
        vp->v_size = vp->v_writesize = newsize;
Index: ufs/ffs/ffs_inode.c
===================================================================
RCS file: /cvsroot/src/sys/ufs/ffs/ffs_inode.c,v
retrieving revision 1.109
diff -u -p -u -r1.109 ffs_inode.c
--- ufs/ffs/ffs_inode.c 27 Jan 2012 19:22:49 -0000      1.109
+++ ufs/ffs/ffs_inode.c 22 Nov 2012 11:31:25 -0000
@@ -400,9 +400,17 @@ ffs_truncate(struct vnode *ovp, off_t le
 
        oip->i_size = osize;
        DIP_ASSIGN(oip, size, osize);
-       error = vtruncbuf(ovp, lastblock + 1, 0, 0);
-       if (error && !allerror)
-               allerror = error;
+       /*
+        * uvm_vnp_setsize() already did the VOP_PUTPAGES(), and this is
+        * all what's needed for regular files (nothing should be in
+        * the bufcache). So avoid the vtruncbuf() in this case, it would
+        * do an extra, possibly costly VOP_PUTPAGES()
+        */
+       if (ovp->v_type != VREG) {
+               error = vtruncbuf(ovp, lastblock + 1, 0, 0);
+               if (error && !allerror)
+                       allerror = error;
+       }
 
        /*
         * Indirect blocks first.
Index: ufs/ufs/ufs_readwrite.c
===================================================================
RCS file: /cvsroot/src/sys/ufs/ufs/ufs_readwrite.c,v
retrieving revision 1.101.2.1
diff -u -p -u -r1.101.2.1 ufs_readwrite.c
--- ufs/ufs/ufs_readwrite.c     7 May 2012 03:01:14 -0000       1.101.2.1
+++ ufs/ufs/ufs_readwrite.c     22 Nov 2012 11:31:25 -0000
@@ -521,6 +527,16 @@ out:
                (void) UFS_TRUNCATE(vp, osize, ioflag & IO_SYNC, ap->a_cred);
                uio->uio_offset -= resid - uio->uio_resid;
                uio->uio_resid = resid;
+               if (error == EDQUOT || error == ENOSPC) {
+                       /* if the process keeps writing (e.g. nfsd),
+                        * UFS_TRUNCATE() may be very expensive as it
+                        * walks the page list. As a workaround flush and
+                        * free all pages associated with this vnode
+                        */
+                       (void)VOP_PUTPAGES(vp, 0, 0,
+                           PGO_ALLPAGES |PGO_CLEANIT | PGO_FREE | PGO_SYNCIO |
+                           PGO_JOURNALLOCKED);
+               }
        } else if (resid > uio->uio_resid && (ioflag & IO_SYNC) == IO_SYNC)
                error = UFS_UPDATE(vp, NULL, NULL, UPDATE_WAIT);
        else

Follow-Ups:
- Re: very bad behavior on overquota writes
  - From: Chuck Silvers
- Re: very bad behavior on overquota writes
  - From: Manuel Bouyer
- Re: very bad behavior on overquota writes
  - From: Manuel Bouyer
- Re: very bad behavior on overquota writes
  - From: Manuel Bouyer

References:
- very bad behavior on overquota writes
  - From: Manuel Bouyer

Prev by Date: Re: very bad behavior on overquota writes
Next by Date: Re: very bad behavior on overquota writes
Previous by Thread: Re: very bad behavior on overquota writes
Next by Thread: Re: very bad behavior on overquota writes
Indexes:

Home | Main Index | Thread Index | Old Index