Subject: Re: pmap tweaking (was Re: Things to work on)
To: Chris Gilbert <chris@paradox.demon.co.uk>
From: Jason R Thorpe <thorpej@zembu.com>
List: port-arm
Date: 05/31/2001 17:09:03
On Fri, Jun 01, 2001 at 12:40:15AM +0100, Chris Gilbert wrote:

 > I've had a play with the pmap stuff.  I've managed to get my head around how 
 > most of it works, and what the terminology is.  I've managed to get the 
 > ./lat_proc fork  (from lmbench) time down to half on a cats.  down to 8000 
 > microseconds from 16000 microseconds, my PII 333 gets 1000 microseconds.  
 > Note that this is by no means accurate or a real world test, just a show that 
 > something is better (note that this is with a PMAP_DEBUG on, DIAGNOSTICS on, 
 > and a whole pile of other debug stuff on).  I'll retest with something more 
 > realistic at some point, eg time make configure for gmake.

Sweet.

 > One reason for the above is that pmap_release currently scans the whole of 
 > the L1 table for entries, however by using a uvm_object and allocating the L2 
 > tables and associating them with the uvm object, you can walk the 
 > uvm_object's list and free them off :)

Actually, when a pmap is destroyed, you can assume there are no mappings
in it at all.  Feel free to add some sort of assertion to this effect.  This
is documented in pmap(9) (have you read through that document?  If so, please
feel free to ask me questions and point out places where it can be clarified.)

 > However I'm having some issues with it that I need to look into.
 > 
 > I'm also playing with some other tweaks to the code, but I really need to sit 
 > down again and work on making some clear notes, eg how to tell a page is 
 > wired, modified, referenced etc.
 > 
 > I also implemented a pool for the pmap objects, so that we don't keep 
 > allocating and freeing them.

You might want to consider some sort of home-grown cache of L1 tables.
IIRC, the ARM uses a 16K table, so they can be tricky to allocate as
memory gets fragmented.  The cache would also allow you to keep the L1
tables "constructed", i.e. the kernel L1 PTEs always valied in the top
N slots of the table (the ARM is a single-address-space system, right?)

 > I've also played with getting rid of the static L1 tables, but somehow they 
 > keep ending up getting fragmented after being freeded and reused *sigh*  It 
 > suggests some kind of vm leak somewhere.  Sadly trying to allocate 4 pages in 
 > a contiguous block will always be a problem.  I did consider putting them in 
 > a pool, but that won't work as we need the pglist.  One thought I had, but 
 > I'm not sure it'll work is to have the l1pt structs in a pool, and keep the 
 > pglist still in the struct, but I've not had time to examine how the pool 
 > code works to tell if this is pheasible.
 > 
 > Note that you can actually have more than 256 processes, I did run lat_ctx 
 > with 300 processes, and it worked.  The issue is getting that 16k of 
 > contigous memory, (I'm also wondering if there's a problem with the Kernel VM 
 > space of finding 16k in that...)
 > 
 > So many things to think on/consider.
 > 
 > If anyone is interested in looking at the stuff I've done so far, let me know 
 > and I'll mail the current diff.
 > 
 > Cheers,
 > Chris

-- 
        -- Jason R. Thorpe <thorpej@zembu.com>