Subject: pmap tweaking (was Re: Things to work on)
To: None <port-arm32@netbsd.org, port-arm@netbsd.org>
From: Chris Gilbert <chris@paradox.demon.co.uk>
List: port-arm
Date: 06/01/2001 00:40:15
Just to update people.

I've had a play with the pmap stuff.  I've managed to get my head around how 
most of it works, and what the terminology is.  I've managed to get the 
./lat_proc fork  (from lmbench) time down to half on a cats.  down to 8000 
microseconds from 16000 microseconds, my PII 333 gets 1000 microseconds.  
Note that this is by no means accurate or a real world test, just a show that 
something is better (note that this is with a PMAP_DEBUG on, DIAGNOSTICS on, 
and a whole pile of other debug stuff on).  I'll retest with something more 
realistic at some point, eg time make configure for gmake.

One reason for the above is that pmap_release currently scans the whole of 
the L1 table for entries, however by using a uvm_object and allocating the L2 
tables and associating them with the uvm object, you can walk the 
uvm_object's list and free them off :)

However I'm having some issues with it that I need to look into.

I'm also playing with some other tweaks to the code, but I really need to sit 
down again and work on making some clear notes, eg how to tell a page is 
wired, modified, referenced etc.

I also implemented a pool for the pmap objects, so that we don't keep 
allocating and freeing them.

I've also played with getting rid of the static L1 tables, but somehow they 
keep ending up getting fragmented after being freeded and reused *sigh*  It 
suggests some kind of vm leak somewhere.  Sadly trying to allocate 4 pages in 
a contiguous block will always be a problem.  I did consider putting them in 
a pool, but that won't work as we need the pglist.  One thought I had, but 
I'm not sure it'll work is to have the l1pt structs in a pool, and keep the 
pglist still in the struct, but I've not had time to examine how the pool 
code works to tell if this is pheasible.

Note that you can actually have more than 256 processes, I did run lat_ctx 
with 300 processes, and it worked.  The issue is getting that 16k of 
contigous memory, (I'm also wondering if there's a problem with the Kernel VM 
space of finding 16k in that...)

So many things to think on/consider.

If anyone is interested in looking at the stuff I've done so far, let me know 
and I'll mail the current diff.

Cheers,
Chris