current-users: Re: memory tester shows up swap/page tuning bug [was Re: BUFFERCACHE, PR 1903]

Subject: Re: memory tester shows up swap/page tuning bug [was Re: BUFFERCACHE, PR 1903]
To: None <Havard.Eidnes@runit.sintef.no, jmarin@pyy.jmp.fi,>
From: Sean Doran <smd@cesium.clock.org>
List: current-users
Date: 09/14/1996 16:48:07
| I've read Sean's message. It looks like it will only postpone the
| the problem, and only  for some memory-usage patterns.

Well, there's no question that if you're doing heavy paging,
you lose, however in the wonderful days of Edition VII Workbench
running on a machine with a whole 640k of RAM and supporting too
many users, one could see exactly the same sort of "everything
stops when swapping is happening" thing when running a pipeline 
of big processes, as one did when converting pic/tbl/grap/(sq)troff
things into a compressed postscript file.

Historically, if you swap out an entire process, you can lose
on disk access time, and this worsens the hotter your swap-based
VM is.  That is, most swapping regimes typically haven't been
very efficient at handling large sequential reads and writes, 
and there is risk that swapping out a particularly
large process will become user-visible.

Moreover, if one ends up swapping out a particularly large process
that then is quickly awakened, you can get a user-visible pause as what
was swapped out is swapped back in again.

Unfortunately, as I am no longer with my previous employer,
I don't have easy access to a full-blown with-source NetBSD
system of my own to play with, but I have a hypothesis.
Essentially, it's that the swapper will suspend most
system activity until the process it's swapping to disk
is fully swapped, and moreover, that a swapped-out process
is fully swapped back into memory rather than demand-paged.
Furthermore, the amount of time a large swapped-out 
process lives in swapped-out state is short compared to the amount 
of time the swapping operation takes.

If swapping slows down or stops everything else, then not swapping
might be a reasonable approach, provided the paging system
doesn't get stuck in circular working-set dependencies (i.e.,
active processes stealing pages from the working-sets of other
active processes).  These could be dealt with by some goo which 
notices really heavy paging activity and begins putting to sleep 
processes which are doing paging-intensive things until the VM cools down.

The swapper is essentially that goo, however it has its
own historical baggage, which could well include requiring
a swapped-out process to be fully swapped-in rather than
being left to the tender mercies of the pager once it is
declared unswapped.

If someone could verify that, that'd be neat.

At any rate, assuming I'm not wrong, people might
want to consider doing the following:

	-- making the swapper be invoked iff the pager
	   is causing the VM system to be too hot
	   (defer swapping)
	-- suspending the swapper if it is hogging the CPU
	   when performing a swap operation (thus avoiding
	   the "other things don't run when things are swapping"
 	   problem)
	-- ensuring that the swap space can be used with 
	   efficiency approaching ccd/FFFS for big sequential
	   operations
	-- keeping swapped-out processes swapped-out with
	   time proportional to the amount of time the
	   swapper would need to swap the process in, or
           letting a swapped-out process be brought back into
	   physical memory through demand-paging (although this might
	   not be the best choice for some architechtures and
	   I/O systems)

An extension of the first bullet might be to have two variants
of what's done with MAXSLP, specifically a multiplier that marks
a process as very pageable and a multipiler that marks a process
as very swappable.

If the VM is short of core and the clock algorithm isn't 
helping, the pager might consider stealing pages from very 
pageable processes instead of invoking the swapper.   
(These processes would have to have been asleep for 
a "long time" (MAXSLP * n) anyway, so this seems reasonable, 
but this is also the logic behind swapping these
processes altogether, and shares similar scaling flaws,
minus the possible breakage of requiring long amounts
of uninterruptible time to migrate entire process-spaces
to and from swap space)

I'm also curious to know whether the swapper moves
shared pages (particularly text pages) to and from swap
when invoked.  I'd hope not, but it's worth a look, since
some of the I/O-time vs CPU time tradeoffs in the Berkeley
VM are interesting.

| The following shar'ed program allocates and touches large amounts of
| physical memory.

This should be committed as /etc/chill. :)

| IMNSO, a VM system should keep a pool of free pageframes at all times,
| to satisfy pagefaults without having to first page something else out.

The problem is that when there is a large short-term demand
for memory, one has a couple of obvious options:

	-- spend lots of CPU in the pager, accelerating the
	   clock hands and getting rid of "LRU" pages
or	-- spend much less CPU by swapping a process that
	   has lots of memory in use and which hasn't run
	   very recently

The current VM does the latter.  This might even be the
better of the two approaches on all but very crunchy CPUs
provided that very big swap operations don't block the operation
of other processes.

Maintaining a pool of memory to satisfy large short-term demand
doesn't seem attractive at first glance, since it reduces
available memory in conditions where the set of active 
pages would fit in physical memory if the spare pool wasn't
there.   To me, this looks like forcing lots of really unnecessary
paging to work around what appears to be a bug in the swapper.

| It *may* (emphasis on the "may") be that NetBSD is failing to do this
| in some circumstances.  I don't really know how to diagnose this,
| when the user-level tools I'd normally use are also getting stomped.

Use kgdb and stare at the same figures that systat stares at
to report the memory figures in the top left of the vmstat
display?

Anyway, as I said, I'd appreciate any code-staring people
could do that would prove or disprove my hypotheses.

Finally, if anybody playing with unified VM/buffer-cache
aging code in other places has insights about whether there
are similar problems in those implementations, I'd be
interested in hearing them.

	Sean.