[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: 16KB page ibm4xx performance
On Wed, Aug 18, 2010 at 04:58:08PM +0000, Eduardo Horvath wrote:
> On Wed, 18 Aug 2010, Masao Uebayashi wrote:
> > On Mon, Aug 16, 2010 at 05:23:56PM +0900, Masao Uebayashi wrote:
> > > I'm testing XIP on OpenBlockS266 (405GPr). It works, but it seems very
> > > slow. If I disable neighbor fault for MADV_NORMAL, /bin/ksh as XIP
> > > starts up in doubled speed, and time -l shows 1/2 page reclaims.
> > >
> > > I guess this platform needs serious TLB / cache tuning...
> > There are a mixture of problems:
> > - Our PowerPC ELF has RWX .data/.got/.plt. If programs' .data sections
> > are not aligned to 16KB, those are mapped as "overlay"; pages are always
> > copied.
> > - Mapping executable pages by pmap_enter() is very expensive because of the
> > __syncicache() operation.
> > - Neighbor fault tends to cause TLB shortage. This is bad especially if
> > exec
> > mappings became victimes.
> > I can get useful speed on 405GPr now by doing the followings:
> > - Use static binaries.
> > - Disable neighbor faults.
> > *
> > I'd recommend all 405GPr users to use only static userland, so you
> > get 2x speed...
This was just a work-around for users, not a real solution.
> Actually, let me make a couple of suggestions.
> 1) Instead of static linking, try changing the ELF page size to somehing
> large than 64KB that way you don't have issues with the different sections
> crossing page boundaries.
That is one. Probably this is done by some linker script magic (which
I couldn't figure out in minutes). Another thing is to support the
read-only PLT .
> 2) The TLB supports multiple page sizes. Add support for multiple page
> sizes to the OS. (Probably a largish project.)
ibm4xx already uses 16MB TLB for reserved maps and demand-loaded PA ==
VA mapping of RAM area (pmap_tlbmiss()). I think this is good enough
> 3) The TLB page replacement algorithm is all done in software. Tweak it.
Personally I like such a part to be not very smart, but behave more
predictable way, like simple LRU.
> 4) Optimize pmap so that expensive operations like __syncicache() are only
> done in pmap_update() and only if needed.
Yeah, something like this should be good.
> I never really spent much time optimizing the 4xx pmap. I was much more
> concerned about the copyin/copyout code that still is a lot slower than I
> would like.
The "big" ones? I'm not surprised that vmaprange() -> uvm_km_alloc()
causes TLB shortage and things much slower. Why can't we use PA == VA
mapping and copy page by page?
In general, ibm4xx uses too many uvm_km_alloc()s. E.g. page tables.
If we use more PA == VA addresses, TLB should be used much much better.
Masao Uebayashi / Tombi Inc. / Tel: +81-90-9141-4635
Main Index |
Thread Index |