Subject: Re: suggested small bufcache and vfs tuning changes
To: Daniel Carosone <dan@geek.com.au>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: tech-kern
Date: 03/22/2004 10:00:18
--Qxx1br4bt0+wmkIi
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

On Fri, Mar 19, 2004 at 11:37:19AM +1100, Daniel Carosone wrote:
> The attached are a couple of tweaks, for wider consideration:
> 
>  - to move buf_drain back to the end of the pagedaemon, and
>    corresponding changes to the sizing calcs in the bufcache code
>    accordingly. This provides better balance, and helps stop crashing
>    the metadata cache down in size too fast.
> 
>  - to the vfs code to make dirdelay a small rotor, which seems to
>    greatly alleviate the problem of softdep creating a huge storm of
>    directory updates all at once, with corresponding bad downstream
>    effects.

Having looked at (and instrumented) the I/O spreading rotor as it is in
your patch, I am not so sure it really does anything unless you happen
to have disks that can clear well upwards of 8,192 metadata operations
in a single second.  I implemented a very simple form of automatic
tuning for the spreading, and it _seems_ to work as intended.  I'm
attaching a patch.

However, there are some significant caveats:

1) I am not _certain_ that all directory I/O appears in the syncer
   worklists as the vnode, of type VDIR, for each directory.  I have
   read and reread the code but I would like more people to look at
   this.

2) If you don't have some kind of working read priority, this change
   can greatly increase the average latency for read transactions when
   the system is under heavy I/O load -- instead of a few seconds when
   almost no read operations complete, perceived by the user as an
   annoying "pause" of the system, you get many more seconds when _few_
   read transactions complete, which will bother many users a lot more.

As it turns out, some disk drivers do not respect the default buffer-queue
strategy setting (including some that *very* questionably use FCFS sort,
perhaps on the assumption that underlying hardware will re-sort).  One of
the most egregious offenders here is 'ld'; I happen to have two ld devices
on my test machine, one that has a large NV cache and resorts very
inteligently (including implementing read priority and read/write balancing) 
and another that does almost no sorting at all.  The latter worked *terribly*
with this change until I fixed the ld driver to respect the default
buffer queue strategy setting; then it worked well with NEW_BUFQ_STRATEGY.

I think that we need to separate the "balance reads and writes/pass reads
around writes" setting from the "sort into block order" setting, and do
the r/w balancing or r priority *even in the FCFS case*.  It's ugly, but
we _could_ quickly implement this by doing both CSCAN and FCFS versions
of the two new strategies, and then choosing "which prioritization strategy"
from the user's kernel configuration, while letting the driver itself
choose CSCAN or FCFS by setting a flag, as is currently done for raw block
number or cylinder number.

I think, at least, we need an FCFS version of "NEW_BUFQ_STRATEGY"
(pure read priority) and the ability for drivers currently using FCFS
sorting to specify that _as a flag_.  The corresponding change to the
PRIO_CSCAN case would be to implement the same three-queues treatment
but with FCFS for each queue; again, all drivers would get whichever
policy was made default by kernel configuration, but get FCFS treatment
per-queue if they flagged it that way.

In the long term, we need to think about adjusting ld.c so that the
driver-specific glue code, which knows much better, can select FCFS
(for something really smart like my Mylex adapter) or CSCAN (for something
not as smart, like my 3ware one).

Comments?  I really think we need to straighten this out for 2.0; luckily
most of the changes are quite small and I have benchmark data that can
easily establish whether they are beneficial or not.

I'm attaching my versions of the buf_drain move and auto-tuning directory
I/O spreading changes, plus the small fix to ld.c to use the default
policy.

Thor

--Qxx1br4bt0+wmkIi
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="spread.diff"

? kern/agefirst.diff
? kern/vfs_bio.c.n
Index: dev/ld.c
===================================================================
RCS file: /cvsroot/src/sys/dev/ld.c,v
retrieving revision 1.26
diff -u -r1.26 ld.c
--- dev/ld.c	29 Jun 2003 22:29:59 -0000	1.26
+++ dev/ld.c	22 Mar 2004 14:57:52 -0000
@@ -151,7 +151,7 @@
 	/* Set the `shutdownhook'. */
 	if (ld_sdh == NULL)
 		ld_sdh = shutdownhook_establish(ldshutdown, NULL);
-	bufq_alloc(&sc->sc_bufq, BUFQ_FCFS);
+	bufq_alloc(&sc->sc_bufq, BUFQ_DISK_DEFAULT_STRAT()|BUFQ_SORT_RAWBLOCK);
 }
 
 int
Index: sys/vnode.h
===================================================================
RCS file: /cvsroot/src/sys/sys/vnode.h,v
retrieving revision 1.121
diff -u -r1.121 vnode.h
--- sys/vnode.h	14 Feb 2004 00:00:57 -0000	1.121
+++ sys/vnode.h	22 Mar 2004 14:57:52 -0000
@@ -385,10 +385,11 @@
  */
 extern struct vnode	*rootvnode;	/* root (i.e. "/") vnode */
 extern int		desiredvnodes;	/* number of vnodes desired */
+extern int		dirburst;	/* dirs to put on syncer list at once */
 extern long		numvnodes;	/* current number of vnodes */
 extern time_t		syncdelay;	/* max time to delay syncing data */
 extern time_t		filedelay;	/* time to delay syncing files */
-extern time_t		dirdelay;	/* time to delay syncing directories */
+extern time_t		dirdelay;	/* min delay for syncing directories */
 extern time_t		metadelay;	/* time to delay syncing metadata */
 extern struct vattr	va_null;	/* predefined null vattr structure */
 
Index: kern/vfs_bio.c
===================================================================
RCS file: /cvsroot/src/sys/kern/vfs_bio.c,v
retrieving revision 1.118
diff -u -r1.118 vfs_bio.c
--- kern/vfs_bio.c	22 Feb 2004 01:00:41 -0000	1.118
+++ kern/vfs_bio.c	22 Mar 2004 14:57:53 -0000
@@ -425,22 +425,31 @@
 static int
 buf_canrelease(void)
 {
-	int pagedemand, ninvalid = 0;
+	int pagedemand, ninvalid = 0, gotnow, giveback;
 	struct buf *bp;
 
 	LOCK_ASSERT(simple_lock_held(&bqueue_slock));
 
-	if (bufmem < bufmem_lowater)
+	gotnow = bufmem - bufmem_lowater;
+
+	if (gotnow < 0)
 		return 0;
 
 	TAILQ_FOREACH(bp, &bufqueues[BQ_AGE], b_freelist)
 		ninvalid += bp->b_bufsize;
 
 	pagedemand = uvmexp.freetarg - uvmexp.free;
-	if (pagedemand < 0)
-		return ninvalid;
-	return MAX(ninvalid, MIN(2 * MAXBSIZE,
-	    MIN((bufmem - bufmem_lowater) / 16, pagedemand * PAGE_SIZE)));
+
+	giveback = MAX(ninvalid, 
+			MAX(2 * MAXBSIZE, 
+				MAX(gotnow / 16, 
+					pagedemand * PAGE_SIZE)));
+
+	if (giveback > gotnow) {
+		giveback = gotnow;
+	}
+
+	return giveback;
 }
 
 /*
Index: kern/vfs_subr.c
===================================================================
RCS file: /cvsroot/src/sys/kern/vfs_subr.c,v
retrieving revision 1.216
diff -u -r1.216 vfs_subr.c
--- kern/vfs_subr.c	14 Feb 2004 00:00:56 -0000	1.216
+++ kern/vfs_subr.c	22 Mar 2004 14:57:58 -0000
@@ -1006,6 +1006,8 @@
 	struct buflists *listheadp;
 	int delay;
 
+	static unsigned int dlinc = 0;
+
 	/*
 	 * Delete from old vnode list, if on one.
 	 */
@@ -1028,7 +1030,7 @@
 		if ((newvp->v_flag & VONWORKLST) == 0) {
 			switch (newvp->v_type) {
 			case VDIR:
-				delay = dirdelay;
+				delay = dirdelay + (dlinc++ / dirburst) % 16;
 				break;
 			case VBLK:
 				if (newvp->v_specmountpoint != NULL) {
Index: miscfs/syncfs/sync_subr.c
===================================================================
RCS file: /cvsroot/src/sys/miscfs/syncfs/sync_subr.c,v
retrieving revision 1.16
diff -u -r1.16 sync_subr.c
--- miscfs/syncfs/sync_subr.c	13 Feb 2004 11:36:23 -0000	1.16
+++ miscfs/syncfs/sync_subr.c	22 Mar 2004 14:57:58 -0000
@@ -54,14 +54,16 @@
 int syncer_maxdelay = SYNCER_MAXDELAY;	/* maximum delay time */
 time_t syncdelay = 30;			/* max time to delay syncing data */ 
 time_t filedelay = 30;			/* time to delay syncing files */
-time_t dirdelay  = 15;			/* time to delay syncing directories */
+time_t dirdelay  = 10;			/* min delay for syncing directories */
 time_t metadelay = 10;			/* time to delay syncing metadata */
 
+int dirburst = SYNCER_MINBURST;
+
 struct lock syncer_lock;		/* used to freeze syncer */
 
 static int rushjob;			/* number of slots to run ASAP */ 
 static int stat_rush_requests;		/* number of times I/O speeded up */
- 
+
 static int syncer_delayno = 0;
 static long syncer_last;
 static struct synclist *syncer_workitem_pending;
@@ -164,12 +166,13 @@
 	struct vnode *vp;
 	struct mount *mp;
 	long starttime;
-	int s;
+	int s, dirsdone;
 
 	updateproc = curlwp;
 	
 	for (;;) {
-		starttime = time.tv_sec;
+		starttime = mono_time.tv_sec;
+		dirsdone = 0;
 
 		/*
 		 * Push files whose dirty time has expired. Be careful
@@ -191,6 +194,7 @@
 					(void) VOP_FSYNC(vp, curproc->p_ucred,
 					    FSYNC_LAZY, 0, 0, curproc);
 					VOP_UNLOCK(vp, 0);
+					if(vp->v_type == VDIR) dirsdone++;
 				}
 				vn_finished_write(mp, 0);
 			}
@@ -240,8 +244,17 @@
 		 * matter as we are just trying to generally pace the
 		 * filesystem activity.
 		 */
-		if (time.tv_sec == starttime)
+		if (mono_time.tv_sec == starttime) {
+			if (dirsdone > dirburst)
+				dirburst += (dirsdone - dirburst) / 2;
+			if (dirburst > SYNCER_MAXBURST)
+				dirburst = SYNCER_MAXBURST;
 			tsleep(&rushjob, PPAUSE, "syncer", hz);
+		} else {
+			dirburst = dirburst / (mono_time.tv_sec - starttime);
+			if (dirburst < SYNCER_MINBURST)
+				dirburst = SYNCER_MINBURST;
+		}
 	}
 }
 
Index: miscfs/syncfs/syncfs.h
===================================================================
RCS file: /cvsroot/src/sys/miscfs/syncfs/syncfs.h,v
retrieving revision 1.6
diff -u -r1.6 syncfs.h
--- miscfs/syncfs/syncfs.h	6 Jan 2003 21:06:47 -0000	1.6
+++ miscfs/syncfs/syncfs.h	22 Mar 2004 14:57:58 -0000
@@ -55,6 +55,8 @@
 extern int (**sync_vnodeop_p) __P((void *));
 
 #define SYNCER_MAXDELAY       32
+#define SYNCER_MAXBURST	      1024
+#define SYNCER_MINBURST	      64
 
 extern int syncer_maxdelay;	/* maximum delay time */ 
 extern struct lock syncer_lock;	/* lock to freeze syncer during unmount */

--Qxx1br4bt0+wmkIi--