Subject: Re: memory tester shows up swap/page tuning bug [was Re: BUFFERCACHE,
To: John S. Dyson <toor@dyson.iquest.net>
From: Jonathan Stone <jonathan@DSG.Stanford.EDU>
List: tech-kern
Date: 09/15/1996 20:38:44
Warning: half-baked opinions follow, it's been a long weekend and
I might say some egregiously wrong things below. Apologies in
advance if I do.
>> I'm also not convinced a complete fix need be so complicated.
>The problem is that the variable that I changed the wakeup to
>is not the right one. I mean, that the whole thing smells of
>hack... Waking up on the pages needed thing is wrong, but works.
Yup. I did a lot of the implementation work for the VM system in the
late incarnations of the Stanford V kernel. Some of that work was
reported in a paper by Cheriton and Harty in an ASPLOS proceedings
from about four years ago, on a memory-market model for large apps.
(like, for a machine with over 400Mbytes of real memory, not bad for
1991.)
The way I'd restructure this part of the VM system is to think
expliclty in terms of *rates* of page events. The long-term rate at
which pages are put on the freelist needs to be at, or slightly above,
the rate at which pagefaults are occuring. (If not, then the system
will run out of free pages, and faulting processes are delayed until
an active page can be written back. Ugh.)
In Mach-like VM systems, new free pages are produced by being cleaned
off the inactive list. (Let's ignore unbinding from VM objects for
now.) Pages in use by processes are marked either as "active" or
"inactive". In the VM system I helped implement, putting a page on
the ``inactive'' list causes sampling of references to that page.
(this invovled a software reference-bit emulation on the hardware we
were using at the time.) Pages in the "inactive" list not actively
touched by a process got moved to a candidate replacement list, where
they are cleaned, and then put on the free list. (well, actually, it
was a "second chance" list; a process could fault a page and reclaim
it off the freelist until the frame was actually reallocated. This
seemed to be a win for some of the apps we ran.)
My point here is that, as the pagefault rate increases, the *rate* at
which pages are sampled for activity needs to increase, so that the
(necessarily lower) rate at which pages are *freed* can also increase
corresondingly.
In Mach/4.4-Lite terms, either the rate at which pages are moved onto
the inactive list, or the time a page stays there before being cleaned
and reclaimed, needs to increase. That's so the rate at which pages
are *freed* (not just cleaned, but also be put back on the freelist)
can also increase. IIRC, one really does need to speed up the
implicit "clock hands", or pages that're accessed regularly but with
long intervals between them stay in memory, which isn't what one wants
in a high-fault-rate envifonment.
NetBSD (and 4.4-Lite and Lite2) make sure to move a minimal number of
pages to the inactive list but take no steps to ensure a reasonable
number of pages are actually *freed*.
The regime I was seeing *without* John's patch is that the inactive list
is huge, about 16Mbytes, but that when the frelist gets exhausted
(i.e., there are absolutely no free pages), the system freezes whilst
the background-rate page cleaning cleans out all the dirty pages in
the active list. With the patch, there are now sufficient free pages
for the memory-hog to keep running, but only getting new pages at
approximately the rate at which the "frozen" system was cleaning
pages.
If I understand John's patch, it works because it's re-invoking the
the pageout daemon anytime a low free-pages condition is signalled (by
a wakeup on the relevant variable, e.g., from swap_pages_iodone(),
after a pageout to swap backing store completes.) This happens
"enough" to stop the system ever running completely out of free pages,
thus avoiding the freeze behaviour.
But (again, IIRC) once we get into the "problem" regime for NetBSD's
VM system, it's essentially serializing page cleaning activity, by
starting another whenever the previous one completes. I think a
better thing to do is to ensure *some* kind of lower bound on the rate
at which pages are cleaned -- especially since we have all this asynch
writeback support.
It looks like it *should* be relatively simple to add another loop--
like the existing loop which forces at least "page_shortage" pages to
be moved from the active to the inactive list, per scan -- which
enforces a minimum rate at which pages on the inactive list are forced
clean and moved to the dirty list. (The maximum rate will, of course,
be determined by the writeback rate to backing store. It would be
nice to not waste CPU cycles trying to write pages back *faster* than
that. A bit like TCP RTT and send-win estimation, perhaps :))
I honestly can't tell if that's the kind of "performance win" John is
talking about, or not. John, what do you think? Is this anything
like what you meant, or liek FreeBSD does?
It might be nice if whatever the NetBSD VM system evolves to could
share at least expertise with FreeBSD. As the two VM systems stand now
(as with autoconfig and bus support in the other direction), it seems
like John and I barely even talking the same language...