Subject: Re: more on mysql benchmark
To: SODA Noriyuki <soda@sra.co.jp>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 03/20/2005 18:13:39
On Mon, Mar 14, 2005 at 12:45:37AM +0900, SODA Noriyuki wrote:
> > with your suggested setting of {{10,80},{0,2},{0,1}}, 100% usage will have
> > to have at least one usage type in the overweight zone, and most likely all
> > of them.
> 
> Yes.
> One of my intention is to put file pages always in the overweight
> zone, because even with file{min,max}={0,1}, file pages often occupy
> too much physical memory (at least on my machine with 128MB RAM).
> Other intention is to stop the problem that page-queue-reordering
> makes anonymous memory paged out.

I'm not sure I've gotten across the point I've been trying to make about
this.  I agree that throwing away the access history is bad, especially
since the current system maintains so little history in the first place,
but the usage-type tunables have only an indirect effect on access history.
the direct effect of these tunables is to keep pages of the some usage types
in memory (even though the access history indicates they should be reused)
when the usage pattern is outside the desired range indicated by the tunables.
reordering the page queue is the mechanism whereby we keep those the pages
in memory.  in the current scheme, for the tunables to have any effect
at all, the page queues will be substantially reordered.

why do you think that your set of tunings reorders the page queue less than
any of the others we've considered?  do you have any evidence of this?

I'm suggesting that the reason your tunable settings have a good effect on
the pc532-machdep.o benchmark is because they closely match the actual
usage-type needs of this particular benchmark, rather than because they
reorder the page queues to a greater or lesser degree than other settings.


> > the history of page access is (supposedly) more maintained by the pmap
> > "referenced" bit than by the position in the paging queue.  cycling
> > through the paging queues more quickly will reduce the effectiveness of
> > that, but we do call pmap_clear_reference() before reactivating pages
> > due to usage-balancing, so they'll be reclaimed the next time around
> > unless they really are referenced again after this.  I guess your point
> > is that you believe this is still giving significant unfair preference
> > to pages that are reactivated due to usage-balancing.
> 
> Yes, that's my point.

ok, then how about we use a different mechanism than reactivating pages
to avoid considering them for re-use?  we could move them to new
anon/exec/file queues that would not be looked at by the pagedaemon loop.
at the start of a pagedaemon run, we could move any pages from the new
queues for the usage types that the pagedaemon will actually reuse
back to the front of the inactive list so that they will be examined first.
this would allow the pagedaemon to skip the pages that it won't reclaim
without giving them any unfair preference later.


> As far as I see, file pages tend to continue to grow at least up to
> vm.filemax and often more under the condition of continous file access.

well, by the definition of these tunables, it's ok for usage of a given type
to grow up to the max for that type as long as no other usage types are over
their max values.  you seem to be disagreeing with the design more than the
tunable settings.  (which is fine, but if the design is really the problem
then let's talk about that directly instead of trying to fix the design by
changing the tunable settings.)


> >> > to respond to some of your other points:
> >> >  - it seems fine to have the sum of the max values be > 100%
> >> >    (though that does make the description of the semantics somewhat
> >> >    awkward).
> 
> >> At memory shortage condition, sum > 100% makes the page daemon
> >> abandon page-access-history due to the page-queue-reordering effect.
> >> That's one of things that I'd like to avoid.
> 
> > like I said earlier, the usage-balancing code will reorder the queues
> > regardless of what the tunables are set to.  I don't see how it's possible
> > to enforce any limits based on usage type without reordering the queues.
> > (it may turn out that if we retain additional access history ala freebsd,
> > then we don't need the usage-type stuff at all, but that seems doubtful.)
> 
> According to the Simon's preliminary result with the yamt's patch, it
> seems actually we don't need the usage-type stuff by default.

... for the pc532-machdep.o benchmark.  as you'll see below, this is not
at all true for other benchmarks.


> That doesn't mean we always don't need the usage-type stuff, though.
> For example, Thor set vm.{anon,file}{min,max}={{10,40}{30,70}} on
> ftp.netbsd.org to prevent supfilesrv and rsyncd from flushing file
> cache. This sort of tuning only can be done by a human who knows
> exact long-term workload, so the usage-type stuff is still useful.

agreed.


> >> >  - I don't know why file{min,max} would want to have any specific
> >> >    relation to exec{min,max}.
> >> 
> >> It's because primary reason of the existence of those VM parameters is
> >> to prevent the famous UBC effect, i.e. file pages kick out anonymous and
> >> executable pages from physical memory.
> >> So, we nearly always have to give executable (and anonymous) pages
> >> priority over file pages.
> 
> > yes, but merely setting execmin (or in your scheme, execmax) to be non-zero
> > guarantees a certain amount of memory for executable pages, regardless of
> > what the tunables for file pages are set to.  so why would it be necessary
> > that the amount of memory guaranteed to be available for exec pages be
> > greater than the amount of memory guaranteed to be available for file pages?
> 
> OK, my description that vm.exec{min,max} must be greater than
> vm.file{min,max} might be wrong.
> The real reason is that even vm.file{min,max}={0,0} often gives too
> much physical memory to file pages.

why do you say that we too much physical memory is given to files?
are you saying this in the context of pc532-machdep.o test or in general?


> >> >  - I would think that the default filemin should be at least 5%, 
> >> >    since that was the default minimum size of the buffer cache when
> >> >    we used it for file data.
> >> 
> >> I don't think so, because usually vm.file{min,max}={0,1} doesn't make
> >> file pages smaller than 5%.
> >> The following is what I saw on the 128MB-RAM-machine with
> >> vm.{anon,exec,file}{min,max}={{10,80},{0,2},{0,1}}:
> >> 
> >> anon exec file
> >> 31%  16%  50%
> >> 28%  15%  50%
> >> 32%  16%  52%
> >> 61%  15%  32%
> >> 74%  14%  20%
> >> 32%   3%  74%
> >> 35%   4%  70%
> >> 77%  15%  15%
> >> 
> >> It seems file pages are rather active even with the above parameters.
> 
> > with whatever workload you were running when you collected those numbers,
> > those settings didn't cause a problem.  my point is that there will be
> > other common workloads where those setting will cause a problem.
> 
> I'm using the setting more than 6 months on machines which have enough
> RAM for anonymous memory and executable memory .
> And as far as I see, the setting doesn't cause any problem except
> sometimes (not always) free pages become too much.

absence of evidence is not evidence of absence.

the thing I'm concerned about with your settings is that
file and exec pages compete much more than they do currently.
2% of memory for exec pages is a lot less than before,
and file activity can steal any exec pages in excess of that.

separately, what is happening that leads you to say that free pages
become too much?


> >> BTW, have you compared your proposal of the new default:
> >> vm.{anon,exec,file}{min,max}={{80,90},{5,30},{5,20}}
> >> with your better sysctl settings for the MySQL benchmark?:
> >> vm.{anon,exec,file}{min,max}={{80,99},{5,30},{1,20}}
> >> 
> >> Also, is it possible to measure my proposal against it?
> >> vm.{anon,exec,file}{min,max}={{10,80},{0,2},{0,1}}
> 
> > I haven't had a chance to do any more runs yet, and I won't get more time
> > until next weekend.  but I'll try them then.
> 
> Thanks.
> Please test yamt's patch with (at least) vm.{anon,exec,file}{min,max}
> ={{0,0},{0,0},{0,0}}, too.

here are the mysql results for the various settings above.
all of these were with a 16 KB fs block size.  I only give the output
from "time" since sysbench was printing "0 transactions per second"
in some cases.


default =	{{10,80},{5,30},{10,50}}
sysctl-mysql =	{{80,99},{5,30},{1,20}}
sysctl-chs =	{{80,90},{5,30},{5,20}}
sysctl-soda =	{{0,80},{0,2},{0,1}}	(equivalent to what you suggested)
sysctl-zero =	{{0,0},{0,0},{0,0}}
fbsd-aging =	yamt's patch


vanilla -current
76.612u 81.170s 34:49.40 7.5%   0+0k 2+3io 54pf+20w

 +sysctl-mysql
76.021u 75.275s 16:53.99 14.9%  0+0k 0+4io 0pf+0w

 +sysctl-chs
61.908u 74.107s 16:50.43 13.4%  0+0k 0+5io 33pf+0w
63.176u 72.353s 16:44.01 13.4%  0+0k 0+3io 33pf+0w

 +sysctl-soda
69.361u 74.506s 16:49.34 14.2%  0+0k 0+3io 22pf+0w
68.051u 72.047s 16:49.34 13.8%  0+0k 0+2io 33pf+0w
66.528u 73.305s 16:50.78 13.8%  0+0k 0+3io 22pf+0w
65.693u 72.425s 16:59.46 13.5%  0+0k 0+1io 33pf+0w
63.669u 73.597s 16:43.80 13.6%  0+0k 0+0io 33pf+0w

 +sysctl-zero
67.580u 79.990s 36:38.08 6.7%   0+0k 2+1io 225pf+24w

 +fbsd-aging
66.811u 80.394s 33:10.02 7.3%   0+0k 2+4io 66pf+26w

 +fbsd-aging +sysctl-zero
78.246u 79.779s 35:55.07 7.3%   0+0k 2+4io 193pf+23w



the sysctl-{mysql,chs,soda} cases are within the margin of error of
each other, they all work equally well.  fbsd-aging does better than
the current mach-based aging scheme, but only by a very small margin.
so either the patch doesn't implement freebsd's aging scheme correctly
or there's something else going on in freebsd in addition to this.
the latter seems more likely to me.


so I think the original point of this thread was to see if there is some
less-bad set of default tunable settings that we could use for the 3.0
release.  I guess I'd be ok with {{0,80},{0,5},{0,1}}, since that keeps
the minimum for exec pages where it was in 2.x but is otherwise what you
were suggesting (and should behave similarly to what I suggested as well).
how does that do in the pc532-machdep.o benchmark?

(I'll see about implementing the other idea I had above on preserving
paging queue order while rebalancing too.)

-Chuck