Subject: vm.bufmem_hiwater not honored (Re: failing to keep a process from
To: None <tech-kern@netbsd.org>
From: Arto Selonen <arto@selonen.org>
List: tech-kern
Date: 11/15/2004 14:04:02
Hi!

While trying to test the patch in PR 26908, I stumbled upon the following
(I was still running stock -current without the patch as baseline):

Swap was in use (again), and all limits seemed to have reasonable values.
Looking at the memory usage, it seemed that kernel (UBC?) was using a lot,
and not giving it up, so parts of squid were being paged out.

Looking at `sysctl vm | grep buf` gave me this:

vm.bufcache = 5
vm.bufmem = 124295168
vm.bufmem_lowater = 6704640
vm.bufmem_hiwater = 53637120

Running `sysctl -w vm.bufcache=5` changed the above values to these:

vm.bufcache = 5
vm.bufmem = 53642240
vm.bufmem_lowater = 6704640
vm.bufmem_hiwater = 53637120

Eventually, bufmem seemed to drop below hiwater again.

Now, a couple of questions come to mind:

1) Why did bufmem go above hiwater mark in the first place?
2) Why didn't it drop below hiwater under low mem condition?


Question 1:
-----------
Looking at the code, I only found two instances where bufmem growth is
checked against bufmem_hiwater (maybe I did not look hard enough?):

	- in src/sys/kern/vfs_bio.c, buf_lotsfree()
	- in src/sys/kern/vfs_bio.c, allocbuf()

Since lotsfree looks fairly simple, I'm assuming that the extra growth
is allowed by allocbuf(), as it simply *tries* to trim the bufmem usage
when hiwater mark is reached. So, presumably it could fail to enforce the
hiwater mark? It looks like buf_canrelease by design can return 0,
if there is no low memory condition (free pages above target free),
and there is nothing on AGE list (part I don't quite understand).
So, and answer to question (1) might be:

"There is a chance that hiwater mark for bufmem is not honored. Let it
happen, as it is probably memory well used, and it should only happen
if there is memory to spare."

I'm not sure if this is a real problem or not. Most of the time there
will be memory, as page daemon will try to keep it that way. The deciding
factor then becomes whether there is anything on that AGE list, or not.

As I've said earlier, I don't mind UBC eating a lot of memory, as long as
it can be easily re-claimed by others in need. Due to various other
issues, that does not seem to be the case (UBC memory usage reduces
vm.{anon,exec,file}{min,max} limits; UBC memory is not claimed before
page scanning by page daemon).


Question 2:
-----------
When the system is starting to run out of free memory, my simplistic view
is that (bufmem-bufmem_hiwater) should be one of the first to go.
Furthermore, the way vm.{anon,exec,file}{min,max} percentages are used
means that the system is not simply using more than it should, it is also
pushing other pages to swap sooner than it should.

page daemon does call buf_drain, but only after it has scanned the page
queues. This means that if bufmem is already above hiwater mark, then
that extra memory is not available on that invocation, and pages may
end up in swap "unneccessarily".

Originally, buf_drain was called after scanning the queus, but in
revision 1.58 of src/sys/uvm/uvm_pdaemon.c it was moved to the top.
In revision 1.60 it was moved back, due to locking problems (PR 27057).

Locking issues aside, is there any reason why page daemon should not
cause bufmem to drop below hiwater mark *before* starting to scan pages?


Problem MeToo:
--------------
Looking at PR 27057 I realize that it may be the same problem we have had
with squid for quite some time, as the disk cache is kept on a partition
like this:

	/dev/wd0h   /squid   ffs   rw,softdep,noatime   1   2

Various incoherent explanations of our problems with squid memory usage
causing partial freezes can be found from PR 25761. The latest we have
seen this was on October 23rd, running -current with sources from
~20041012 (and I still have a forced crash dump of that one available).

I was also hoping to repeat the problem as part of the PR 26908 patch
test, but I'm not sure if bufmem>bufmem_hiwater was a required condition
(and I just got rid of that). System uptime is now 5 days, so repeating
these issues can take days (up to a week or two).


Loosely related question:
-------------------------
In src/sys/kern/vfs_bio.c function buf_canrelease() there is the following:

        return MAX(ninvalid, MIN(2 * MAXBSIZE,
            MIN((bufmem - bufmem_lowater) / 16, pagedemand * PAGE_SIZE)));

The description for the function says that it returns the number of bytes.
I've understood (incorrectly?) that bufmem (and _{lo,hi}water) are in
bytes, so why is the difference divided by 16?


Artsi
-- 
#######======------  http://www.selonen.org/arto/  --------========########
Everstinkuja 5 B 35                               Don't mind doing it.
FIN-02600 Espoo        arto@selonen.org         Don't mind not doing it.
Finland              tel +358 50 560 4826     Don't know anything about it.