Subject: Re: vm.bufmem_hiwater not honored (Re: failing to keep a process
To: Daniel Carosone <dan@geek.com.au>
From: Arto Selonen <arto@selonen.org>
List: tech-kern
Date: 11/17/2004 11:23:11
Hi!

On Wed, 17 Nov 2004, Daniel Carosone wrote:

> Please try the attached diff, which I have been using since we last
> looked at all this issue earlier this year.

> I would be very interested to see how it affects your issue.

I am almost certain that it will improve my situation. However, I don't
think it is a *solution* for it.

As Thor Lancelot Simon said, "buffer cache growth algorithm is _extremely_
conservative". I have now run 'systat bufcache -w 1' for about one full
day, and have made the following observations:

	1) bufmem seems to jump way over bufmem_hiwater once a day
	   (I suspect midnight/logroll/squid, and try to confirm ASAP)

	2) during the day, there is an obvious trend for bufmem to shrink
		- every time page daemon scans, the number of metadata
		  buffers decreases, and so bufmem shrinks too
		- this is about 2MB (~500 pages) per hour for bufmem
		  and a bit less than 1000 buffers per hour
		- over a period of 10 hours, the numbers dropped:
			- bufmem: 42MB -> 21MB
			- buffers: 12,000 -> 3,500

	3) within the shrinking trend, bufmem fluctuates: it both *grows*
	   and shrinks in size, though the number of buffers remains the
	   same (this is while free>freetarg); this happens even when
	   bufmem>bufmem_hiwater

This morning, buffer cache usage was again:

	vm.bufmem = 45648896
	vm.bufmem_lowater = 4194304
	vm.bufmem_hiwater = 33554432

I saw vm.bufmem to be over 60,000,000, but it managed to somehow
drop to that 45M while I was writing this. I believe Thor in that bufmem
should NOT be able to get so much above the hiwater mark (or at least
that is how I interpreted his comments). I consider that to be the main
problem in my case.

Your patch will probably help because it will change the allocbuf
behaviour when a buffer is resized (which I'm assuming causes the
fluctuations in bufmem usage, and which happens a lot). Since these take
place when free>freetarg, and without your patch buf_canrelease would
return 0 (if the AGE list is empty), then allocbuf would not react to
bufmem already being over hiwater mark. With the patch, canrelease will
almost always return non-zero, thus buffer usage will be trimmed through
buf_trim(), and so every resize while bufmem>bufmem_hiwater could indeed
reduce buffer cache size, making it a lot faster than by only relying
on page daemon to reduce it a bit.

Anyway, once I get some confidence in understanding why/how bufmem is
currently behaving, then I'll try your buf_canrelease patch.


  ---- IF YOU ARE BUSY, STOP HERE; "EXTREME" PROGRAMMING FOLLOWS -----


As for the patch itself, here is my reasoning for buf_canrelease():
(take it as both proof of code, and an explanation as to how I see this)

	- The comment for the function says
	  "Return estimate of bytes we think need to be
	   released to help resolve low memory conditions."

	  I disagree partly, as there may not be any low memory condition,
	  but there may still be a need to release some buffer cache
	  bytes. Of course this all depends on why the function
	  exists, and as I don't know the real reason, I've made up
	  my own: "Return the number of bytes one could ask buffer cache
	  to release, if there was a need to reduce buffer cache size".

		NOTE: this changes the current semantics
		NOTE: this leaves the size decision to caller

	  Currently, buf_canrelease seems to be used only by allocbuf()
	  when a resize would lead bufmem to be over bufmem_hiwater.
	  That call is "unconditional" in the sense that it leaves
	  the size decision to buf_canrelease: I think it should only
	  want to reduce buffer cache size as long as bufmem>bufmem_hiwater

	  buf_canrelease can not know *why* the caller would like to
	  reduce the buffer cache size (and there are at least two
	  different cases: resize exceeding hiwater, and page daemon
	  asking buffer cache to participate in freeing memory), thus
	  it can not make that decision. It can only give a suggestion as
	  to how much *it* would like buffer cache size to be reduced, if
	  somebody wanted to do that.

	- Now that I've defined the reason for buf_canrelease to exist,
	  I can define how I think it might function. It should fill
	  the following conditions:

		- take as an argument 'requested' (bytes caller wants)
		- never say bufmem could go below lowater mark
		- always offer enough to go below hiwater mark
		- never offer more than was requested
		  (requesting 0, means no preference)
		- always offer at least as much as AGE list has
		- always offer at least freemin (could be freetarg)
		- offer to shrink 1/16 of current usage
		- try not to offer more than two NMEMPOOLS worth

	  The above would lead to something like this:
	  (actual implementation is left to reader)

		if (request <= 0)
			request = bufmem-bufmem_lowater;
		return MAX(0,
			MIN(bufmem-bufmem_lowater,
 			 MAX(bufmem-bufmem_hiwater,
			  MIN(request,
			   MAX(bufqueues[BQ_AGE].bq_bytes,
			    MAX(freemin*PAGE_SIZE,
			     MIN((bufmem-bufmem_lowater)/16,2*MAXBSIZE)))))))

	  I don't know about efficiency, but it looks fairly clean
	  and simple. It does not need to be very exact either, as it
	  is only a suggestion (and the caller will need to make the
	  final decision anyway).

		NOTE: it will almost never return 0
		NOTE: it will suggest more than request if over hiwater

	- Currently, the only user is allocbuf, which could be modified
	  for this approach quite easily (reusing variables, copying
	  buf_drain; I guess one could do two versions of buf_drain:
	  one with locking and another one without, so either one could be
	  called depending on whether locks were already set or not):

		if ((bufmem += delta) > bufmem_hiwater) {
			int target, got;
			target = buf_canrelease(bufmem-bufmem_hiwater);
			got = 0;
                	while (got < target) {
				delta = buf_trim();
                	        if (delta == 0)
                        	        break;
				got += delta;
        	        }
	        }

	- With the above, one could take advantage of buf_canrelease
	  also in page daemon (again, efficieny might be a concernt):

		buf_drain (buf_canrelease(bufcnt));

This may well break all sorts of conditions that I'm not aware
of. There may be timing isssues, and what not. I may be messing
on critical path, where you really don't want this sort of thing.


Artsi
-- 
#######======------  http://www.selonen.org/arto/  --------========########
Everstinkuja 5 B 35                               Don't mind doing it.
FIN-02600 Espoo        arto@selonen.org         Don't mind not doing it.
Finland              tel +358 50 560 4826     Don't know anything about it.