Subject: Re: more on mysql benchmark
To: Chuck Silvers <chuq@chuq.com>
From: SODA Noriyuki <soda@sra.co.jp>
List: tech-kern
Date: 03/10/2005 06:21:06
>>>>> On Tue, 8 Mar 2005 18:44:45 -0800, Chuck Silvers <chuq@chuq.com> said:

>> > vm.filemin=5
>> > vm.filemax=20
>> > vm.anonmin=80
>> > vm.anonmax=90
>> > vm.execmin=5
>> > vm.execmax=30
>> 
>> > this is probably closer to what various people think is appropriate
>> > in general anyway.  comments?
>> 
>> I think vm.filemax + vm.anonmax + vm.execmax must be < 100,
>> otherwise very nasty behaviour may happen.
>> For example, please think about the following scenario:
>> 
>> - Continuous file access is ongoing.
>>   And at the same time, anonymous pages take 90%.
>> - The requrement of anonymous pages is increasing, and it reaches 91%.
>> - Since the anonymous pages exceed vm.anonmax, the pages are now not
>>   protected by the vm.anonmax parameter.
>> - On the other hand, file pages are now becomes protected by vm.filemax,
>>   because it is currently only using 10% and that's less than vm.filemax.
>> - So, file pages now becomes to steal anonymous pages.
>> 
>> What we see in this scenario is that anonymous pages lose its physical
>> memory when the requreiment of the anonymous pages increases.

> in effect, an existing anon page is paged out and the memory is reused for
> the new anon page.

No. That is not what I saw.
As far as I see, the paging-out doesn't stop, even when the rate of
the anonymous pages drops to less than vm.anonmax. It only stops when
the rate of file pages almost exceeds vm.filemax.
I guess this is because the active page queue is reordered by
uvm_pageactivate() which is called from uvmpd_scan_inactive().
So, the page deamon considers that the file pages are more active
than anoymous and executable pages. And as far as the memory shortage
condition continues, the page deamon abandons anonymous and executable
pages.

This is undesirable because:
- This page-queue-reordering abandons the history of page access.
- So, the decision of the page deamon becomes really bad.
- The system suddenly behaves differently and discontinuously at that
  condition.

>> I actually saw such situation with current default (anonmax=80 and
>> filemax=50), and file pages actually took 50% of the memory at that
>> time on the machine which had 128MB RAM.

> sure.  but to prevent anon pages from being paged out and reused for file
> pages, the knob you have to use is anonmin, not filemax.

Yes, if one has exact knowledge about how the system wants anonymous,
executable, and file pages.
But the problem is that what we are talking about here is the default
values, so we don't actually know how many pages are needed.

I think I should describe my background more.
The following is what Simon and I saw 6 months ago.
(I really should post this result long ago. Sorry, esp. Simon.)

You may remember the following Simon's benchmark on pc532:
http://mail-index.netbsd.org/tech-perform/2002/11/27/0000.html

At 1.5X age, this benmark took 9:33.82 with current default VM tuning
parameters. And it took only 2:06.78 with vm.anonmax=95.
So, it was considered that anonmax=95 was better than anonmax=80 in
this benchmark on his pc532.
But that's not exactly correct, according to what Simon and I saw...

On NetBSD 2.0F/pc532, the benchmark took 44:52.74 (!) even with same
compiler and same VM parameters. So, we did try some tuning as below:

time (real)		      vm.bufmem_hiwater	vm.{anon,exec,file}{min,max}
44:52.74  74+ 67io 325664pf+1w	1228800 (15%)	{{10, 95},{ 5,30},{10,50}} [0]
14:52.89  40+ 68io  96290pf+1w	1228800 (15%)	{{10, 95},{ 5,30},{ 1, 2}} [1]
14:02.90  96+ 66io  89123pf+1w	 409600 ( 5%)	{{10, 95},{ 5,30},{ 1, 2}} [2]
13:26(*)  71+ 89io  83238pf+3w	 131072 ( 1.6%)	{{10, 95},{ 5,30},{ 1, 2}} [3]
 2:13.61  81+116io   4621pf+1w	 131072 ( 1.6%)	{{10, 95},{ 0, 1},{ 1, 2}} [4]
15:19.45  65+ 92io 102869pf+1w	 131072 ( 1.6%)	{{10, 95},{ 0,30},{ 1, 2}} [5]
 3:53.07  66+ 96io  16304pf+1w	 131072 ( 1.6%)	{{10,100},{ 0,30},{ 0, 1}} [6]
 2:13(*)			 131072 ( 1.6%)	{{10, 95},{ 0,15},{ 0, 1}} [7]
 1:49.73  77+ 97io   2868pf+1w	 131072 ( 1.6%)	{{10, 90},{ 0,15},{ 0, 1}} [8]
 1:56.12  55+ 94io   3436pf+1w	 131072 ( 1.6%)	{{10, 80},{ 0,15},{ 0, 1}} [9]
(*) not exact value.

According to these results:
- too much vm.file{min,max} makes the result really worse. - [0] vs [1]
- vm.bufcahe=15 is a bit too large for this benchmark. - [1] vs [2]/[3]
- too much vm.execmax makes the result really worse, too. - [3]/[5] vs [4]/[7]
  simon found that reducing vm.execmax from 20 to 15 seems to be
  about the turning point of this difference. - [5]/[6] vs [7]
- only making vm.anonmax=100 isn't enough to solve the problem of
  too much vm.execmax. - [6]
- vm.anonmax=95 is actually slower than vm.anonmax=80. - [4]/[7] vs [9]
  although best result is the middle of [7] and [9],
  i.e. vm.anonmax=90 [8].

If vm.anonmin is large enough, probably we don't have to make
vm.{exec,file}{min,max} such small values.

But to prevent the page-queue-reordering effect above, I want to make
vm.{exec,file}{min,max} rather smaller, and in that case, vm.anonmin
doesn't have to be large. Because vm.anonmax just does same job with
vm.anonmin, if either executable pages or file pages exceed its max,
and if anonymous pages don't exceed vm.anonmax.

And using smaller vm.anonmin makes the system automatically adapt
more usage patterns.

>> Having filemax+anonmax+execmax>=100 leaves no memory for page daemon
>> to adapt memory requirements for various situation, this is really
>> problematic.
>> 
>> What I think more appropriate is something like:
>> vm.{anon,exec,file}{min,max}={{10,80},{1,2},{0,1}}
>> The rationales of this setting are:
>> - Leave 17% of memory for page deamon.
>>   This may be still too small, and perhaps we should *decrease* (not
>>   increase) anonmax.
>> - For machines which have large RAM (e.g. 1GB), even 2% of memory
>>   is enouch for working sets of executable pages.
>>   For machines which have really small RAM, working sets of executable
>>   pages are often really really small (since such machine doesn't use
>>   bloated software like mozilla). For example, what I saw in simon's
>>   famous gcc benchmark on pc532 is that executable pages only take
>>   1.5% of free+active+inactive pages.
>> - vm.file{min,max} must be less than vm.exec{min,max}.
>> 
>> One drawback with the above setting is that sometimes page daemon
>> abandons too much vnode pages and makes too much free pages, but
>> I guess that's because genfs_putpages() (or somewhere else?) does
>> too much job, and I think such too much work should be fixed instead.

> not much of the above makes sense to me, I think you misunderstand
> what the sysctl tunables mean.

Well, I think I don't misunderstand. ;-)
e.g. Isn't following my post same with what you said?
http://mail-index.netbsd.org/current-users/2004/09/01/0000.html

> to respond to some of your other points:
>  - it seems fine to have the sum of the max values be > 100%
>    (though that does make the description of the semantics somewhat awkward).

At memory shortage condition, sum > 100% makes the page daemon
abandon page-access-history due to the page-queue-reordering effect.
That's one of things that I'd like to avoid.

>  - I don't know why file{min,max} would want to have any specific
>    relation to exec{min,max}.

It's because primary reason of the existence of those VM parameters is
to prevent the famous UBC effect, i.e. file pages kick out anonymous and
executable pages from physical memory.
So, we nearly always have to give executable (and anonymous) pages
priority over file pages.

>  - it's ok to have the min value for (eg.) exec pages to larger than what
>    will actually be used in practice.  the extra pages will be used for
>    other things unless and until the demand for exec pages increases.

It's not ok for machines which don't have enough RAM....

>  - I would think that the default filemin should be at least 5%, 
>    since that was the default minimum size of the buffer cache when
>    we used it for file data.

I don't think so, because usually vm.file{min,max}={0,1} doesn't make
file pages smaller than 5%.
The following is what I saw on the 128MB-RAM-machine with
vm.{anon,exec,file}{min,max}={{10,80},{0,2},{0,1}}:

	anon exec file
	 31%  16%  50%
	 28%  15%  50%
	 32%  16%  52%
	 61%  15%  32%
	 74%  14%  20%
	 32%   3%  74%
	 35%   4%  70%
	 77%  15%  15%

It seems file pages are rather active even with the above parameters.

>    as I recall, it was 10% of memory up to some threshold, and 
>    then 5% of any additional memory.

Yes, we used to use 10% of the first 2MB of memory, and 5% of the rest,
with a minimum of 16 buffers.
On modern machines, this is neary same with 5% of the physical memory.

But at that age, both data and metadata shared the 5% of the physical
memory, and because now data use different memory from metadata,
I think the memory for the data (i.e. UBC file pages) can be less
than 5%.

BTW, have you compared your proposal of the new default:
	vm.{anon,exec,file}{min,max}={{80,90},{5,30},{5,20}}
with your better sysctl settings for the MySQL benchmark?:
	vm.{anon,exec,file}{min,max}={{80,99},{5,30},{1,20}}

Also, is it possible to measure my proposal against it?
	vm.{anon,exec,file}{min,max}={{10,80},{0,2},{0,1}}
--
soda