Subject: Re: more on mysql benchmark
To: SODA Noriyuki <soda@sra.co.jp>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 03/13/2005 05:46:21
hi,

(let me first ramble a bit with some ideas that hopefully clarify things,
then I'll respond to some specific points in your previous mail.)

the way I think about the sysctl vm.* knobs is that the min and max values
for a usage type specify lower and upper thresholds of a "normal" amount of
memory to be used.  if the current usage is below the min value then that
type is underweight, and if the current usage is above the max value then
that type is overweight.  as long as all three usage types are in the same
zone of this spectrum (all underweight, all normal, or all overweight) then
pages are examined for reference activity in the order they were created.
(well, if they are all underweight then there must be some free pages, so
the pagedaemon isn't being run, but you get the idea.)  however, if the
current usages are in different zones, then we will only examine pages of
the usage type(s) closest the overweight end.  this will have the effect of
moving the balance back toward the three types all having "normal" usage.

another way to look at the *max tunables is that the max for each type is
really a min for the sum of the usage of the other two types.  eg., to make
sure anon pages get some minimum amount of memory, use vm.anonmin.  to make
sure file and exec pages together get some minimum amount of memory, but
without specifying how much of each individually, use vm.anonmax.

ideally all of the pages in the system will contain valid data of one form
or another (cached file data if there isn't enough demand for anon pages),
so the total usage will often be 100% or close to that.  with current
defaults (and with the alternate settings that I suggested), 100% usage
will usually be with all three usage types in the normal zone, and the
pagedaemon will switch to the asymmetric consideration mode when one of the
types enters the overweight zone (or if pages are freed for some other
reason such as a process exiting or a file being truncated, which could put
a type into the underweight zone).

with your suggested setting of {{10,80},{0,2},{0,1}}, 100% usage will have
to have at least one usage type in the overweight zone, and most likely all
of them.  note that your settings are equivalent to {{80,80},{2,2},{1,1}}
since each usage type will stop being reclaimed as soon as it goes below
its max, since at least one of the other types will have to be above its
max at that point.  it's also equivalent to {{0,80},{0,2},{0,1}} and
{{80,100},{2,100},{1,100}}.  each of these basically collapses the
three-zone scheme to a two-zone scheme.

no matter what the settings are, the pagedaemon will start reordering the
page queues in the process of skipping types of pages when the usage is out
of balance, and it's not clear to me that this process has any greater or
less effect with any particular set of values.  I suspect that the
differences in resulting application performance are more due to changing
the number of pages we'll allow for each usage type rather than the order
in which pages of a given type are considered for reuse.  I'm not sure how
to confirm that experimentally, though.  some things to try might be to
compare the uvmexp.pdre{anon,file,exec} counters under different tunings,
or to add some code to reactivate some pages (instead of considering them
for reuse) using some criteria other than the usage balancing stuff and see
how that affects performance under various tunings.

the point of the freebsd-based patch from yamt (and the "generational"
scheme that a few people experimented with a few years back) is that the
single bit of access history that the mach-derived paging-queue system
maintains isn't nearly enough information to allow decent decisions based
on access patterns, so these schemes retain more history.  I believe the
main difference between these is that under the freebsd scheme, continued
accesses give a page a linear boost in retention priority, whereas under
the proposed generational scheme, continued accesses would give a page an
exponential boost.  either of these would mitigate the queue-reordering
effects of enforcing the usage-balance tunables.  so I guess implementing
one of these would also be a good way to see which effect of changing the
sysctl tunables is making more of a difference (and it seems like a good
improvement in any case).

more comments in-line below.


On Thu, Mar 10, 2005 at 06:21:06AM +0900, SODA Noriyuki wrote:
> >>>>> On Tue, 8 Mar 2005 18:44:45 -0800, Chuck Silvers <chuq@chuq.com> said:
> 
> >> > vm.filemin=5
> >> > vm.filemax=20
> >> > vm.anonmin=80
> >> > vm.anonmax=90
> >> > vm.execmin=5
> >> > vm.execmax=30
> >> 
> >> > this is probably closer to what various people think is appropriate
> >> > in general anyway.  comments?
> >> 
> >> I think vm.filemax + vm.anonmax + vm.execmax must be < 100,
> >> otherwise very nasty behaviour may happen.
> >> For example, please think about the following scenario:
> >> 
> >> - Continuous file access is ongoing.
> >>   And at the same time, anonymous pages take 90%.
> >> - The requrement of anonymous pages is increasing, and it reaches 91%.
> >> - Since the anonymous pages exceed vm.anonmax, the pages are now not
> >>   protected by the vm.anonmax parameter.
> >> - On the other hand, file pages are now becomes protected by vm.filemax,
> >>   because it is currently only using 10% and that's less than vm.filemax.
> >> - So, file pages now becomes to steal anonymous pages.
> >> 
> >> What we see in this scenario is that anonymous pages lose its physical
> >> memory when the requreiment of the anonymous pages increases.
> 
> > in effect, an existing anon page is paged out and the memory is reused for
> > the new anon page.
> 
> No. That is not what I saw.
> As far as I see, the paging-out doesn't stop, even when the rate of
> the anonymous pages drops to less than vm.anonmax. It only stops when
> the rate of file pages almost exceeds vm.filemax.
> I guess this is because the active page queue is reordered by
> uvm_pageactivate() which is called from uvmpd_scan_inactive().
> So, the page deamon considers that the file pages are more active
> than anoymous and executable pages. And as far as the memory shortage
> condition continues, the page deamon abandons anonymous and executable
> pages.

whether or not anon pages continue to be paged out depends on what those
freed pages are reused for.  if they are reused for more anon pages, then
the balance unchanged and we'll continue paging out only anon pages.
if the freed pages are used to hold file data, then pretty quickly
anon usage will be less than anonmax and both anon and file pages will
once again be eligable for reuse.


> This is undesirable because:
> - This page-queue-reordering abandons the history of page access.
> - So, the decision of the page deamon becomes really bad.
> - The system suddenly behaves differently and discontinuously at that
>   condition.

the history of page access is (supposedly) more maintained by the pmap
"referenced" bit than by the position in the paging queue.  cycling
through the paging queues more quickly will reduce the effectiveness of
that, but we do call pmap_clear_reference() before reactivating pages
due to usage-balancing, so they'll be reclaimed the next time around
unless they really are referenced again after this.  I guess your point
is that you believe this is still giving significant unfair preference
to pages that are reactivated due to usage-balancing.


> >> I actually saw such situation with current default (anonmax=80 and
> >> filemax=50), and file pages actually took 50% of the memory at that
> >> time on the machine which had 128MB RAM.
> 
> > sure.  but to prevent anon pages from being paged out and reused for file
> > pages, the knob you have to use is anonmin, not filemax.
> 
> Yes, if one has exact knowledge about how the system wants anonymous,
> executable, and file pages.
> But the problem is that what we are talking about here is the default
> values, so we don't actually know how many pages are needed.
> 
> I think I should describe my background more.
> The following is what Simon and I saw 6 months ago.
> (I really should post this result long ago. Sorry, esp. Simon.)
> 
> You may remember the following Simon's benchmark on pc532:
> http://mail-index.netbsd.org/tech-perform/2002/11/27/0000.html

only vaguely, it's been a while.  :-)
(I've reread it now.)


> At 1.5X age, this benmark took 9:33.82 with current default VM tuning
> parameters. And it took only 2:06.78 with vm.anonmax=95.
> So, it was considered that anonmax=95 was better than anonmax=80 in
> this benchmark on his pc532.
> But that's not exactly correct, according to what Simon and I saw...
> 
> On NetBSD 2.0F/pc532, the benchmark took 44:52.74 (!) even with same
> compiler and same VM parameters. So, we did try some tuning as below:
> 
> time (real)		      vm.bufmem_hiwater	vm.{anon,exec,file}{min,max}
> 44:52.74  74+ 67io 325664pf+1w	1228800 (15%)	{{10, 95},{ 5,30},{10,50}} [0]
> 14:52.89  40+ 68io  96290pf+1w	1228800 (15%)	{{10, 95},{ 5,30},{ 1, 2}} [1]
> 14:02.90  96+ 66io  89123pf+1w	 409600 ( 5%)	{{10, 95},{ 5,30},{ 1, 2}} [2]
> 13:26(*)  71+ 89io  83238pf+3w	 131072 ( 1.6%)	{{10, 95},{ 5,30},{ 1, 2}} [3]
>  2:13.61  81+116io   4621pf+1w	 131072 ( 1.6%)	{{10, 95},{ 0, 1},{ 1, 2}} [4]
> 15:19.45  65+ 92io 102869pf+1w	 131072 ( 1.6%)	{{10, 95},{ 0,30},{ 1, 2}} [5]
>  3:53.07  66+ 96io  16304pf+1w	 131072 ( 1.6%)	{{10,100},{ 0,30},{ 0, 1}} [6]
>  2:13(*)			 131072 ( 1.6%)	{{10, 95},{ 0,15},{ 0, 1}} [7]
>  1:49.73  77+ 97io   2868pf+1w	 131072 ( 1.6%)	{{10, 90},{ 0,15},{ 0, 1}} [8]
>  1:56.12  55+ 94io   3436pf+1w	 131072 ( 1.6%)	{{10, 80},{ 0,15},{ 0, 1}} [9]
> (*) not exact value.
> 
> According to these results:
> - too much vm.file{min,max} makes the result really worse. - [0] vs [1]
> - vm.bufcahe=15 is a bit too large for this benchmark. - [1] vs [2]/[3]
> - too much vm.execmax makes the result really worse, too. - [3]/[5] vs [4]/[7]
>   simon found that reducing vm.execmax from 20 to 15 seems to be
>   about the turning point of this difference. - [5]/[6] vs [7]
> - only making vm.anonmax=100 isn't enough to solve the problem of
>   too much vm.execmax. - [6]
> - vm.anonmax=95 is actually slower than vm.anonmax=80. - [4]/[7] vs [9]
>   although best result is the middle of [7] and [9],
>   i.e. vm.anonmax=90 [8].
> 
> If vm.anonmin is large enough, probably we don't have to make
> vm.{exec,file}{min,max} such small values.
> 
> But to prevent the page-queue-reordering effect above, I want to make
> vm.{exec,file}{min,max} rather smaller, and in that case, vm.anonmin
> doesn't have to be large. Because vm.anonmax just does same job with
> vm.anonmin, if either executable pages or file pages exceed its max,
> and if anonymous pages don't exceed vm.anonmax.
> 
> And using smaller vm.anonmin makes the system automatically adapt
> more usage patterns.

I think my initial comments in this message explain what I think is
wrong with some of the conclusions you're drawing here.  I can go through
things more specifically if you still disagree.


> >> Having filemax+anonmax+execmax>=100 leaves no memory for page daemon
> >> to adapt memory requirements for various situation, this is really
> >> problematic.
> >> 
> >> What I think more appropriate is something like:
> >> vm.{anon,exec,file}{min,max}={{10,80},{1,2},{0,1}}
> >> The rationales of this setting are:
> >> - Leave 17% of memory for page deamon.
> >>   This may be still too small, and perhaps we should *decrease* (not
> >>   increase) anonmax.
> >> - For machines which have large RAM (e.g. 1GB), even 2% of memory
> >>   is enouch for working sets of executable pages.
> >>   For machines which have really small RAM, working sets of executable
> >>   pages are often really really small (since such machine doesn't use
> >>   bloated software like mozilla). For example, what I saw in simon's
> >>   famous gcc benchmark on pc532 is that executable pages only take
> >>   1.5% of free+active+inactive pages.
> >> - vm.file{min,max} must be less than vm.exec{min,max}.
> >> 
> >> One drawback with the above setting is that sometimes page daemon
> >> abandons too much vnode pages and makes too much free pages, but
> >> I guess that's because genfs_putpages() (or somewhere else?) does
> >> too much job, and I think such too much work should be fixed instead.
> 
> > not much of the above makes sense to me, I think you misunderstand
> > what the sysctl tunables mean.
> 
> Well, I think I don't misunderstand. ;-)
> e.g. Isn't following my post same with what you said?
> http://mail-index.netbsd.org/current-users/2004/09/01/0000.html

ok, I see we're on the same page up to a point.
I guess it's more that we are seeing different implications of the
underlying behaviour.


> > to respond to some of your other points:
> >  - it seems fine to have the sum of the max values be > 100%
> >    (though that does make the description of the semantics somewhat awkward).
> 
> At memory shortage condition, sum > 100% makes the page daemon
> abandon page-access-history due to the page-queue-reordering effect.
> That's one of things that I'd like to avoid.

like I said earlier, the usage-balancing code will reorder the queues
regardless of what the tunables are set to.  I don't see how it's possible
to enforce any limits based on usage type without reordering the queues.
(it may turn out that if we retain additional access history ala freebsd,
then we don't need the usage-type stuff at all, but that seems doubtful.)


> >  - I don't know why file{min,max} would want to have any specific
> >    relation to exec{min,max}.
> 
> It's because primary reason of the existence of those VM parameters is
> to prevent the famous UBC effect, i.e. file pages kick out anonymous and
> executable pages from physical memory.
> So, we nearly always have to give executable (and anonymous) pages
> priority over file pages.

yes, but merely setting execmin (or in your scheme, execmax) to be non-zero
guarantees a certain amount of memory for executable pages, regardless of
what the tunables for file pages are set to.  so why would it be necessary
that the amount of memory guaranteed to be available for exec pages be
greater than the amount of memory guaranteed to be available for file pages?


> >  - it's ok to have the min value for (eg.) exec pages to larger than what
> >    will actually be used in practice.  the extra pages will be used for
> >    other things unless and until the demand for exec pages increases.
> 
> It's not ok for machines which don't have enough RAM....

well, if the machine doesn't have enough RAM then the min value for
exec pages will not be larger than what is needed (or at least I think
that's what you mean by "doesn't have enough RAM").  I think we're
talking past each other a bit on this point.


> >  - I would think that the default filemin should be at least 5%, 
> >    since that was the default minimum size of the buffer cache when
> >    we used it for file data.
> 
> I don't think so, because usually vm.file{min,max}={0,1} doesn't make
> file pages smaller than 5%.
> The following is what I saw on the 128MB-RAM-machine with
> vm.{anon,exec,file}{min,max}={{10,80},{0,2},{0,1}}:
> 
> 	anon exec file
> 	 31%  16%  50%
> 	 28%  15%  50%
> 	 32%  16%  52%
> 	 61%  15%  32%
> 	 74%  14%  20%
> 	 32%   3%  74%
> 	 35%   4%  70%
> 	 77%  15%  15%
> 
> It seems file pages are rather active even with the above parameters.

with whatever workload you were running when you collected those numbers,
those settings didn't cause a problem.  my point is that there will be
other common workloads where those setting will cause a problem.


> >    as I recall, it was 10% of memory up to some threshold, and 
> >    then 5% of any additional memory.
> 
> Yes, we used to use 10% of the first 2MB of memory, and 5% of the rest,
> with a minimum of 16 buffers.
> On modern machines, this is neary same with 5% of the physical memory.
> 
> But at that age, both data and metadata shared the 5% of the physical
> memory, and because now data use different memory from metadata,
> I think the memory for the data (i.e. UBC file pages) can be less
> than 5%.

well, that depends on whether what wants to be cached is data or metadata.

I think we don't really want to differentiate file system data vs. metadata
for purposes of competing with process (anon/exec) pages, but we do want
to guarantee that there is always some memory available for metadata, since
we need to access indirect blocks in order to write data pages.  I'd rather
avoid making the usage-balancing scheme more complicated if we can help it,
though.

another idea I looked at long ago was caching the specific metadata required
for writing a data page along with the page somehow, so that we would not
need to access buffers for indirect blocks in order to write data pages
back to disk.  then we wouldn't need to guarantee that any memory be
available specifically for metadata buffers.  that got messy pretty quickly
though, and it might be even more complicated to force file systems to
stash bits and pieces of their metadata all over the place, since that
would also force them to know how to go update or invalidate it.


> BTW, have you compared your proposal of the new default:
> 	vm.{anon,exec,file}{min,max}={{80,90},{5,30},{5,20}}
> with your better sysctl settings for the MySQL benchmark?:
> 	vm.{anon,exec,file}{min,max}={{80,99},{5,30},{1,20}}
> 
> Also, is it possible to measure my proposal against it?
> 	vm.{anon,exec,file}{min,max}={{10,80},{0,2},{0,1}}

I haven't had a chance to do any more runs yet, and I won't get more time
until next weekend.  but I'll try them then.

-Chuck