Subject: Suggested fix for NetBSD 1.2 kernels
To: Laine Stump <>
From: Jonathan Stone <jonathan@DSG.Stanford.EDU>
List: tech-kern
Date: 09/17/1996 17:28:05

Yes, the fix works.   I've tested it under a much more aggressive
VM shortage than you'll see under most  real workloads.  I reported
that to this list.  The fix is adequate for most usage patterns.

For reference, a tidier version of the patch, which works better
on machines with clocks not at 100Hz (i.e., most things other
than i386es and sparcs), is appended at the end of this message.

Running something like the page-thrasher I posted *will* show up
performance anomalies, even with the fix.

The *code* in 4.4-Lite and 4.4-Lite2 is clearly buggy. Having the
pageout daemon sleep for a second when freepages are low, or even
totally exhuasted, is just silly.  (I've said all along that that's
what I thought was happening, based on the disk drive sound; 
John Dyson took some convincing.)

On that basis, the fix below is a candidate for an Official Patch
for NetBSD 1.2.   It may, however, be too late to get the fix into the
kernels on official release media.  Building new keernels that are
shipped *with* the release addresses the problem adequately.
I undertake to  make the case for that.

*However*, (and this is addressed to tech-kern or even a more
technical audience) even with the fix, the design of the 4.4-Lite VM
system under mild-to-heavy paging load and is flawed. (Okay, I'll say
it: it's *wrong*. No offense meant to Mike Hibler or anyone else.)

I justify that by observing that the performance under sustained heavy
loads is *still* poor. The pathological memory-toucher does not freeze
up entirely with the patch applied; but I'm still only seeing from 10%
to %50 of the page writeback rate that the memory-toucher was
sustaining, in its unfrozen periods, without the patch.

For those who care,  the bug is that there is *nothing* in the
code that enforces a minimum rate  at which pages on the inactive
list are actually clenead and made free.  John Dyson's patch
fixes this by waking up the pager whenver pages are short. (Wolfgang Solfrank) writes:

>Well, actually it's simpler than this. Without John's patch the sleep is for
>one second. With John's patch it sleeps till the next time that vm_pages_needed
>is wakeup'ed. This may be done by something signalling low free-pages condition.

Yes, I understood that.  That's why I quoted the function that was
doing the wakeup(). Sorry for the typo in the name.

The point I'm trying to get across is that
even with the fix applied, architecturally or philosophically, the
pager is doing the *WRONG THING* when it gets woken up.  Or rather,
it's not *DIRECTLY* doing the *right* thing, which is to make sure
some pages are freed up.  All it's doing is forcing more pages from
the active list to the inactive list, and thus (hopefully) causing
some of those pages to eventually get freed.

Running *without* John's patch, with a "systat -w 1 vmstat" already
running in a local (rlogin, not Xterm) window, is an excellent way to
demonstrate that.  During the freeze period, the pageout process is
being awakened once a second, and doing its thing.  But that doesn't
help at *all* until it's cleaned *all* the pages on the inactive list.
None of those pages actually get freed until *every* page on the
inactive list gets  written back.  My inactive list had 18Mbytes
of pages. 

The pager tries to start cleaning (i.e,. initiates asynchronous
writeback) on at least cnt.v_free_min pages under memory-shortfall
conditions.  That's 64 pages, or 256Kbytes, on my systems. Assume the
inactive list is approximately 1/3 of physical memory.  That lets you
estimate how long the freezes will last; it's fairly accurate on my
systems.  [[note 1]]

What happens *with* John's patch applied is that the system spends "a
lot" of time moving pages onto the inactive list.  Under sustained VM
system load, even *with* John Dyson's patch applied, the free-list
*still* stays far too small (e.g., 12 free pages, total, for several
seconds).  The low number of free pages limits the rate at which
pagefaults can be sustained, and so limits the rate at which Useful
Stuff makes progress.  It also burns lots of cycles.  The patch
``works'' in that avoids the freezes, but it's not doing quite
the  right thing.   That's the observation is the comment I made
earlier,  which Wolfgang answered. 

Perhaps it's fair to say that Wolfgang gave a correct, but
"shallow" answer, whereas I was looking for a "deep" answer.

As far as I can see, the sleep/wakeup synchronization between the
pageout daemon and swap_pager_iodone() can potentially cause
serialization of page cleaning, which is Not A Good Thing At All on a
system that doesn't support page clustering, and doesn't have hardware
with a deep write buffer (e.g., tagged SCSI command or on-drive
writeback buffers) to hide the write latency.

A more substantial reworking is to force the pageout daemon to
actually *clean* pages at some minimum rate, and to put those pages on
the free list immediately [seee note 1], much as the pageout daemon
*already* enforces a minimum number of pages to be put on the inactive
list.  (What the pageout daemon currently does, even with the patch,
is not always enough to ensure either an adequate supply of free
pages, or good use of avaiable backing-store write bandwidth, or of
CPU time.)

I'm happy to discuss this, privately or perhaps on tech-kern, with
people who have direct experience in or more of implementation,
performance measurement, and tuning, of VM systems baed on
active/inactive lists, like the VM code in 4.4Lite/NetBSD.
(The qualification is because it's hard enough even for John
 Dyson and  I to be sure we're talking about the same thing.)

Index: vm_pageout.c
RCS file: /cvsroot/src/sys/vm/vm_pageout.c,v
retrieving revision 1.23
diff -c -r1.23 vm_pageout.c
*** vm_pageout.c	1996/02/05 01:54:07	1.23
--- vm_pageout.c	1996/09/18 00:14:45
*** 70,75 ****
--- 70,76 ----
  #include <sys/param.h>
  #include <sys/proc.h>
+ #include <sys/kernel.h>
  #include <vm/vm.h>
  #include <vm/vm_page.h>
*** 326,332 ****
  		 * shortage, so we put pause for awhile and try again.
  		 * XXX could get stuck here.
! 		(void) tsleep((caddr_t)&lbolt, PZERO|PCATCH, "pageout", 0);
  	case VM_PAGER_FAIL:
--- 327,333 ----
  		 * shortage, so we put pause for awhile and try again.
  		 * XXX could get stuck here.
! 		(void) tsleep((caddr_t)&vm_pages_needed, PZERO|PCATCH, "pageout", hz);
  	case VM_PAGER_FAIL:
*** 453,459 ****
  	if (postatus == VM_PAGER_AGAIN) {
  		extern int lbolt;
! 		(void) tsleep((caddr_t)&lbolt, PZERO|PCATCH, "pageout", 0);
  		goto again;
  	} else if (postatus == VM_PAGER_BAD)
  		panic("vm_pageout_cluster: VM_PAGER_BAD");
--- 454,461 ----
  	if (postatus == VM_PAGER_AGAIN) {
  		extern int lbolt;
! 		(void) tsleep((caddr_t)&vm_pages_needed, PZERO|PCATCH, "pageout", hz);
  		goto again;
  	} else if (postatus == VM_PAGER_BAD)
  		panic("vm_pageout_cluster: VM_PAGER_BAD");