Subject: Re: cpu_switch (was Re: 1.5 Release documentation ...)
To: Neil A. Carson <neil@causality.com>
From: Richard Earnshaw <rearnsha@arm.com>
List: port-arm32
Date: 11/08/2000 14:10:02
> Richard Earnshaw wrote:
> 
> > IMHO It's still a mess and needs a rewrite from the ground up.  I've done
> > some hacks at home and managed to remove approximately 90% of the cache
> > flush calls from some routines; but we are still flushing the cache far
> > too often and the impact of the changes I've made is not as significant as
> > one might expect from the headline figure.
> 
> Which routines - are you sure they are correct?

Well the details are at home, so I'll see what I can remember off hand.

IIRC the first thing I looked at was pmap_kenter and friends.  By 
modifying these we could save entering pages into the pv tables that would 
only ever be mapped once.  This in itself saves a small amount of table 
walking.

Having done this, based on the code for the x86 pmap, it would appear that 
there is no need to go to splimp() each time we want to walk the pv tables 
(a simple lock will suffice, and even then, this is only needed for 
multi-processor machines).

The next thing to note is that pmap_copy_page and pmap_zero_page are 
invalidating the cache if the page is mapped into more than one process.  
This is overkill, since it is only the current pmap that is relevant with 
respect to cache flushing -- if the page isn't part of the current 
process, then there is no need to flush the cache at all; if it is in the 
current process, then it is normally only mapped at one address, so in 
that case we only have to clean a single page (we often clean the whole 
cache at present).  The extra cost of walking more of the pv tables is 
more than compensated for by the number of cache cleans that we eliminate.

Next we can add a pmap_page_idle_zero function -- this is similar to 
pmap_zero_page, except that we can assert that the page isn't mapped 
anywhere (since otherwise it wouldn't be on the free page list).  This 
still blasts part of the cache, but this is less significant because we 
always blast a VAC on a context switch anyway.

Finally, I've been looking at implementing pmap_copy to duplicate an 
entire pmap (this is potentially cheaper than taking a large number of 
faults when a new process starts up).  However, experiments so far haven't 
shown any performance increase here (though we might win if we were to 
temporarily enable caching of the pmap tables while doing the copy).  I 
think the main reason for the non-startling result is that normally fork() 
is immediately followed by exec(), so we don't get to execute many of the 
entries we have just copied.  The jury is still out on this one.

> 
> You are right that process exit is the last one. A long time ago I
> suggested the FreeBSD optimisation of cleaning the whole thing out in
> one go (especially since even on x86 for large proceses traversing the
> tables in pmap_remove for hunge hunks of process just nicely flushes
> parts of the cache for you!) which would mean you could do one flush per
> exit on ARM. Even on x86 for FreeBSD this made big scalability
> improvements.

Yes, either that, or to make something like a pmap_update call at the end 
of the unmaps -- we could then build a task list of the invalidates and do 
them all with a single cache flush.

> 
> Can't remember what happened to that.

:-(

> 
> Also another thing was some changes I hacked to make the page tables
> temporarily cacheable to make remove function faster. Made some
> difference but with above change this would probably not be necessary.

As I mentioned above, this might be a win with pmap_copy().

> 
> It's all ugly stuff. I sure think the pmap can be rewritten to tidy it
> up a lot, but I'm not sure you'll win much on performance. I think my
> changes a few years ago had got 80% of the performance win with 20% of
> the pain.

Well the new x86 pmap is much cleaner, easier to understand and 
significantly faster, even on a 75MHz pentium -- much of this is 
architectural, but there is still much that can be done to the ARM pmap to 
improve things.  In particular, we need to implement some of the pmap 
stealing functions to more gracefully handle running out of L1 page tables 
and the like.

R.