tech-kern: Re: memory tester shows up swap/page tuning bug [was Re: BUFFERCACHE,

Subject: Re: memory tester shows up swap/page tuning bug [was Re: BUFFERCACHE,
To: Wolfgang Solfrank <ws@kurt.tools.de>
From: Jonathan Stone <jonathan@DSG.Stanford.EDU>
List: tech-kern
Date: 09/15/1996 17:26:57
[[migrating a thread about machines freezing under heavy VM load
 from current-users to tech-kern; please direct replies appropriately.
 This would go in a NetBSD PR, but they seem to be down at the moment. ]]


>Actually, the VM system does keep a pool of free pageframes at all times, at
>least it tries to. See above.

In the circumstances in question, it's not even trying very hard.


My best guess  to what's going on is this:

	* Free pages are completely exhausted. (the free-page
	  count, as shown by systat's vmstat display, is blank,
	  and therefore zero.)

	* The  cnt.v_free_count is less than the  desired "cnt.v_freetarget",
	  forcing the pager to scan.  I see this when running a memory
	  hog, on a machine with 64Mbytes of memory, 2/3 of which is
	   active, 1/3 of which is inactive.

	* Because there are no free pages left,  scanning  is
	  activated, perhaps from vm_pageout()?.  This scanning forces
	  cleaning of at least  min_free pages (per second).  These
	  pages are put in the laundry (asynchronous writeback) but
	  *are not* put on the free list.


	* daefr (v_dfree from vmmeter) is zero (not shown by systat vmstat)
	  during the entire freeze.  This proves that, although pages are
	  being shown as paged out at a rate equal to the scan rate, the
	  pages are only being cleaned by being written back, not acutally
	  freed.

	  In other words, during the freeze periods, *nothing  at
	  all** is being done, except a quiet writeback of dirty pages;
	  64 pages/sec in my case.

	  I have verified that during the "freeze" periods, the disk I/O
	  rate, the scan rate, and the "pageout" rate are all equal,
	  on my machine.  None of the active, inactive, or free
	  page counts, frequently  *do not  change at all* during
	  the freeze periods.


	* Eventually, *something* wakes up and forces the now-cleaned
	  pages to be put on the free list.


My current guess is that the cause of the wakeup is that the entire
inactive list has to be cleaned before the pageout daemon finds any
clean pages. Once this is done (64 pages at a time!), then
vm_pageout_scan() finally finds some clean pages and frees them.


The one thing in this analysis that doesn't hold water, is how
vm_pageout_scan() can be running enough to initiate cleaning on 64
pages per second, yet manage to not free *any* pages (as shown by
systat's vmstat display) for over 60 seconds.  (i.e., how does it
get out of the loop?) Yet that's exactly what i'm seeing.


Since the point has been missed at least once, I'll repeat it:
NetBSD's VM system has a bug, where free pages get totally exhausted
causing the system to lock up, or freeze, for several seconds to over
a minute.  The system is doing essentially *nothing* during this
entire time, except a very low  rate (I'd call it a background rate)
of cleaning pages by writing them to disk.    The active, inactive,
and free page counts can remain constant during an entire freeze
(depending on the behaviour of other processes lucky enough to have
all their pages in memory).   The freeze persists for long enough for
the background-rate page cleaning to clean the entire active list;
this may be a conicidence or may be causally linked to subsequen behaviour.

For whatever reason, at some point the system decides to *free* some
pages, and unfreezes, until the pages cleaned during the freeze are
all re-used and the freelist is, again, exhausted.  At that point, the
cycle repeats.

Charles, is it at all possible that the TAILQ changes  to vm_pageout.c
are causing a pathological ordering of the active list?  I'm not
suggesting this, I'd just like to rule it out as a possiblity.

I've looked at the 4.4-Lite2 VM changes and nothing leaps out as being
related to this.   Does anyone (Mike Hibler, perhaps, or any of the
FreeBSD people) recognize these symptoms at all?  It's a very annoying
bug, to say the least....