Subject: Re: Raising NFS parameters for higher bandwith or "long fat pipe"
To: Manuel Bouyer <bouyer@antioche.lip6.fr>
From: Jonathan Stone <jonathan@DSG.Stanford.EDU>
List: tech-net
Date: 12/13/2003 14:58:58
Manuel confirms that NetBSD (-current and 1.6.2_RC2) has asymmmetric
NFS client performance: with a NetBSD NFS server, Solaris NFS clients
sustain 100Mbyte/sec.

For a NetBSD-current client, NFS reads throughput is about half of
Solaris, and with writes throughput about 80% of solaris. (.1.6.2_RC2
shows a similar spread, but is consistently slower than -current).

I found a FreeBSD 4.7 NFS client is somewhat better, and has roughly
symmetric read/write performance. Some digging in the code suggests
FreeBSD does better on reads becasue it issues explicit NFS read-ahead,
in addition to the `read one ahead' we do (which FreeBSD does too).

Below is an untested, almost certinaly broken, patch showing how
FreeBSD 4.x does NFS readahead. It' basically two chunks:

  1. Adding a loop to sys/nfs/nfs_bio.c:nfs_bioread() to compute
	offsets and issue additional read-aheads.

  2. Changing the api to nfs_asyncio() so that the extra read-aheads
     can pass appropriate credentials (and struct proc *?) to
     nfs_async().

I dont know FreeBSD kernel internals well enough to rework these to
work correctly in NetBSD right off the bat. I'd appreciate some help,
as this should make a big difference to NFS client read performance.

* There must be a better way to compute the read-ahead block number!

* Should the vfs_busy_pages/vfs_unbusy_pages map directly to
  uvm_page_busy/uvm_page_unbusy()?  Or can we just ignore the FreeBSD
  vfs_{,un}busy_page() since our nfs client code doesn't do anything
  similar in in its other path?

* How about the change to pass a struct proc * and creds to nfs_async()?
  I'm  actively looking at NFS with RPSEC_GSS righ tnow, so  passing
  creds looks right to me.  But does it break anything?

* What's up with  the cr_hold() with no apparent matching decrement of
  the credential refcount?

NB: if & when if I commit this I will credit FreeBSD. Do the FreeBSD
people here know of anyone specific that should get credit for it?
(Matt Dillon, maybe?)


Index: nfs_bio.c
===================================================================
RCS file: /cvsroot/src/sys/nfs/nfs_bio.c,v
retrieving revision 1.113
diff -u -r1.113 nfs_bio.c
--- nfs_bio.c	2003/11/17 02:02:31	1.113
+++ nfs_bio.c	2003/12/13 22:06:23
@@ -94,6 +94,7 @@
 	int enough = 0;
 	struct dirent *dp, *pdp;
 	off_t curoff = 0;
+	int nra;
 
 #ifdef DIAGNOSTIC
 	if (uio->uio_rw != UIO_READ)
@@ -226,9 +227,45 @@
 	    }
 	    baddr = (caddr_t)0;
 	    switch (vp->v_type) {
-	    case VREG:
+	    case VREG: {
+		off_t lbn, rabn;
+		u_int biosize, seqcount;
+		
 		nfsstats.biocache_reads++;
 
+		/* XXX jrs, 2003-12-08: issue more readaheads here */
+		lbn = uio->uio_offset / biosize;
+		on = uio->uio_offset & (biosize - 1);
+		biosize = vp->v_mount->mnt_stat.f_iosize;
+		seqcount = (u_int)((off_t)(ioflag >> 16) * biosize / 16384);
+		/*
+		 * Start the read ahead(s), as required.
+		 */
+		if (nfs_numasync > 0 && nmp->nm_readahead > 0) {
+		    for (nra = 0; nra < nmp->nm_readahead && nra < seqcount &&
+			(off_t)(lbn + 1 + nra) * biosize < np->n_size;
+			nra++) {
+			rabn = lbn + 1 + nra;
+			if (!incore(vp, rabn)) {
+			    rabp = nfs_getcacheblk(vp, rabn, biosize, p);
+			    if (!rabp)
+				return (EINTR);
+			    if ((rabp->b_flags & (B_CACHE|B_DELWRI)) == 0) {
+				rabp->b_flags |= (B_READ | B_ASYNC);
+				/* vfs_busy_pages(rabp, 0); */
+				if (nfs_asyncio(rabp, p->p_ucred, p)) {
+				    rabp->b_flags |= B_INVAL|B_ERROR;
+				    /* vfs_unbusy_pages(rabp); */
+				    brelse(rabp);
+				    break;
+				}
+			    } else {
+				brelse(rabp);
+			    }
+			}
+		    }
+		}
+
 		error = 0;
 		if (uio->uio_offset >= np->n_size) {
 			break;
@@ -250,6 +287,7 @@
 		}
 		n = 0;
 		break;
+	    }
 
 	    case VLNK:
 		nfsstats.biocache_readlinks++;
@@ -454,7 +492,7 @@
 			    if ((rabp->b_flags & (B_DONE | B_DELWRI)) == 0) {
 				rabp->b_dcookie = nndp->dc_cookie;
 				rabp->b_flags |= (B_READ | B_ASYNC);
-				if (nfs_asyncio(rabp)) {
+				if (nfs_asyncio(rabp, curproc->p_ucred, curproc)) {
 				    rabp->b_flags |= B_INVAL;
 				    brelse(rabp);
 				}
@@ -780,22 +818,41 @@
  */
 
 int
-nfs_asyncio(bp)
+nfs_asyncio(bp, cred, proc)
 	struct buf *bp;
+	struct ucred *cred;
+	struct proc *proc;
 {
 	int i;
 	struct nfsmount *nmp;
 	int gotiod, slpflag = 0, slptimeo = 0, error;
 
+	/*
+	 * If no async daemons then return EIO to force caller to run the rpc
+	 * synchronously.
+	 */
 	if (nfs_numasync == 0)
 		return (EIO);
 
 	nmp = VFSTONFS(bp->b_vp->v_mount);
+
+#ifdef notyet
+	/*
+	 * Commits are usually short and sweet so lets save some cpu and 
+	 * leave the async daemons for more important rpc's (such as reads
+	 * and writes).
+	 */
+	if ((bp->b_flags & (B_READ|B_NEEDCOMMIT)) == B_NEEDCOMMIT &&
+	    (nmp->nm_bufqiods > nfs_numasync / 2)) {
+		return(EIO);
+	}
+#endif
+
 again:
 	if (nmp->nm_flag & NFSMNT_INT)
 		slpflag = PCATCH;
 	gotiod = FALSE;
- 
+
 	/*
 	 * Find a free iod to process this request.
 	 */