Subject: Re: more on mysql benchmark
To: SODA Noriyuki <soda@sra.co.jp>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 03/08/2005 18:44:45
On Mon, Mar 07, 2005 at 12:16:29PM +0900, SODA Noriyuki wrote:
> >>>>> On Sun, 6 Mar 2005 17:38:09 -0800, Chuck Silvers <chuq@chuq.com> said:
> 
> > vm.filemin=5
> > vm.filemax=20
> > vm.anonmin=80
> > vm.anonmax=90
> > vm.execmin=5
> > vm.execmax=30
> 
> > this is probably closer to what various people think is appropriate
> > in general anyway.  comments?
> 
> I think vm.filemax + vm.anonmax + vm.execmax must be < 100,
> otherwise very nasty behaviour may happen.
> For example, please think about the following scenario:
> 
> - Continuous file access is ongoing.
>   And at the same time, anonymous pages take 90%.
> - The requrement of anonymous pages is increasing, and it reaches 91%.
> - Since the anonymous pages exceed vm.anonmax, the pages are now not
>   protected by the vm.anonmax parameter.
> - On the other hand, file pages are now becomes protected by vm.filemax,
>   because it is currently only using 10% and that's less than vm.filemax.
> - So, file pages now becomes to steal anonymous pages.
> 
> What we see in this scenario is that anonymous pages lose its physical
> memory when the requreiment of the anonymous pages increases.

in effect, an existing anon page is paged out and the memory is reused for
the new anon page.  why would this be undesirable?  this has to happen
at some point, we don't want to page out all file and exec pages before
we start reusing anon pages.


> I actually saw such situation with current default (anonmax=80 and
> filemax=50), and file pages actually took 50% of the memory at that
> time on the machine which had 128MB RAM.

sure.  but to prevent anon pages from being paged out and reused for file
pages, the knob you have to use is anonmin, not filemax.  hopefully the
description below will help.


> Having filemax+anonmax+execmax>=100 leaves no memory for page daemon
> to adapt memory requirements for various situation, this is really
> problematic.
> 
> What I think more appropriate is something like:
> 	vm.{anon,exec,file}{min,max}={{10,80},{1,2},{0,1}}
> The rationales of this setting are:
> - Leave 17% of memory for page deamon.
>   This may be still too small, and perhaps we should *decrease* (not
>   increase) anonmax.
> - For machines which have large RAM (e.g. 1GB), even 2% of memory
>   is enouch for working sets of executable pages.
>   For machines which have really small RAM, working sets of executable
>   pages are often really really small (since such machine doesn't use
>   bloated software like mozilla). For example, what I saw in simon's
>   famous gcc benchmark on pc532 is that executable pages only take
>   1.5% of free+active+inactive pages.
> - vm.file{min,max} must be less than vm.exec{min,max}.
> 
> One drawback with the above setting is that sometimes page daemon
> abandons too much vnode pages and makes too much free pages, but
> I guess that's because genfs_putpages() (or somewhere else?) does
> too much job, and I think such too much work should be fixed instead.

not much of the above makes sense to me, I think you misunderstand
what the sysctl tunables mean.

the *min tunables are the minimum percentage of physical memory that
is guaranteed to be available for each usage type.  so if we have
anon{min,max} = {10,80} (the current default), then we will never page out
anon pages as long as the percentage of memory currently in use for anon pages
is 10 or less.  once the percentage of memory is used for anons is greater
than 10, then we might page out some anon pages, depending on additional
factors.

the *max tunables are harder to describe.  the idea was to give some notion
of how aggressive we should be at reclaiming pages of each usage type.
the exact implementation of this is that as long as the memory currently
in use for any usage type is over its max, we will only reclaim pages from
usage types whose usage is over their max values.  so continuing the previous
example, if anon usage is at 90% and the file and exec usage are below their
max, then we would only page out anon pages until anon usage is less than
80%.  then we would page out any type of page whose usage is greater than
its min value.

so any memory in excess of the sum of the min values is moved around by
the pagedaemon to adapt allocation to usage.

to respond to some of your other points:
 - the sum of the min values does need to be < 100% (and we enforce that).
 - it seems fine to have the sum of the max values be > 100%
   (though that does make the description of the semantics somewhat awkward).
 - I don't know why file{min,max} would want to have any specific relation to
   exec{min,max}.
 - it's ok to have the min value for (eg.) exec pages to larger than what
   will actually be used in practice.  the extra pages will be used for
   other things unless and until the demand for exec pages increases.
 - I would think that the default filemin should be at least 5%, since that was
   the default minimum size of the buffer cache when we used it for file data.
   as I recall, it was 10% of memory up to some threshold, and then 5% of any
   additional memory.


now I'm not saying that these semantics for VM tuning are the best ones,
or even that they're very good, but they're what we have today.  I chose
this set of knobs mostly because the implementation was relatively
straightforward and the policy evaluation itself wouldn't consume a lot of
CPU cycles.  if someone wants to suggest a better set of tunables,
that would be a fine discussion to have.


> BTW, I think it is better to decrease current lower bound of
> vm.bufcache from 5% to something like 1% (Note that I'm not talking
> about default value but lower bound value here). Because it's often
> desirable to use such small value on machines like pc532 which have
> smaller RAM.

sure, allowing the administrator to set the limit lower if they desire
would be fine.

-Chuck


> P.S. (not to chuck, but to other audience)
> Please note that above discussion is only talking about default
> settings, there are situation that it's better to set vm.file{min,max}
> larger values.
> --
> soda