tech-kern: Re: vm.bufmem_hiwater not honored (Re: failing to keep a process

Subject: Re: vm.bufmem_hiwater not honored (Re: failing to keep a process
To: Thor Lancelot Simon <tls@rek.tjls.com>
From: Arto Selonen <arto@selonen.org>
List: tech-kern
Date: 11/16/2004 11:00:08
Hi!

On Mon, 15 Nov 2004, Thor Lancelot Simon wrote:

> On Mon, Nov 15, 2004 at 10:11:31PM +0200, Arto Selonen wrote:

> > Well, page daemon is asking for something back, with the buf_drain() call
> > after page scanning etc. However, in my case bufmem was >17,000 *pages*

> I happen to believe that freetarg should be considerably higher on modern
> large-memory systems.  Others may disagree; we seem to discuss it here
> from time to time but come to no really good conclusion.

Mmm. Another thing I forgot to mention about the buf_drain call in page
daemon. When the call was before page scanning, doing
"bufcnt=freetarg-free;buf_drain(bufcnt);" made sense. As in
"we would like to guarantee at least freetarg, if buffer cache is still
above lowater mark". Now that buf_drain call is after the scan, it
seems to be merely to "excercise" buffer cache and get some memory back.
Unless it is really useful to get different-sized drain requests, the
whole bufcnt could be dropped, no? Whatever is released by buf_drain at
that point is just memory to be used (by whoever needs it later), so
why make precise calculations about the amount to be freed from bufcache?

To use a low estimate of "freetarg-free", one could use freemin/2, leading
to "we would like roughly 10% of freed memory to come from buffer cache,
as everybody needs to give back some, and buffer cache is using ~15% of
all", or use freemin as high estimate: "we would like roughly 20% ...". Of
course, if BUFCACHE differs much from the default 15%, then this would
lead to some bias.

Yes, I understand nobody is going to touch the code over such a minor
thing (especially since I could be way off here), but it is an observation
I thought worth mentioning.

> I am curious as to _how_ bufpages got to be so high.  Do you have,
> perhaps, a huge number of directories on a very-large-block filesystem?

Not knowing what you might consider large/huge, I'll let you be the judge.
It is probably squid's disk cache, which resides on a partition of its
own. The FFS partition is mounted rw,softdep,noatime (I mention this as
softdeps may be at least partially involved; just a guess, though).
The partition is roughly 16 GB in size, although squid is only using
a bit less than half of that. The block size for the partition is 8 kB
(default, I think).

The cache directory hierarchy has two levels: first one has 32
directories (00..1F), and each one of them has 256 subdirectories
(00..FF). The disk cache holds roughly 400,000 objects, with about 100,000
requests handled per day by squid (representing about 1 GB of data).

> The buffer cache growth algorithm is _extremely_ conservative.  Once it
> gets to bufmem_hiwater, it should _always_ recycle an existing buffer
> rather than allocating a new one.  The algorithm is from an old lecture

Yes, I saw the commit message and the code, if you mean the code in
buf_lotsfree. In that case, it would only allow hiwater mark to be
exceeded once, as there is a check before the probabilistic check
(which would not allow another attempt either). But what if that
one allowed attempt is large enough? Or a resize of an existing buffer
in allocbuf() ?

> size, I don't understand how your system got into the situation it is
> in, at all.
>
> And I would very much like to.

Thank you for sharing the interest. :) Any help is gladly accepted.

I just took a look at the current situation, and this is what I'm seeing:

% sysctl vm | grep buf
vm.bufcache = 5
vm.bufmem = 46371840
vm.bufmem_lowater = 4194304
vm.bufmem_hiwater = 33554432

As you see, I tuned both hi and lo water marks since my initial posting,
and bufmem was below hiwater after that change. So, within the last 12-20
hours (I was not monitoring it or collecting statistics, yet), it has
again exceeded the upper limit by some 3,000 pages.

I am open to any suggestions as to what to monitor/collect/try.
One of the probable times for the problems to appear is right after
midnight, when logs are rotated, and that means that squid forks.

I haven't touched the system now, as I wanted to see if there is any
further change in vm.bufmem. I am willing to try pretty much anything
reasonable, considering that the system is in production use, and
so I don't want to disrupt the services too much/often.


Artsi
-- 
#######======------  http://www.selonen.org/arto/  --------========########
Everstinkuja 5 B 35                               Don't mind doing it.
FIN-02600 Espoo        arto@selonen.org         Don't mind not doing it.
Finland              tel +358 50 560 4826     Don't know anything about it.