Re: 16KB page ibm4xx performance

To: Eduardo Horvath <eeh%NetBSD.org@localhost>
Subject: Re: 16KB page ibm4xx performance
From: Masao Uebayashi <uebayasi%tombi.co.jp@localhost>
Date: Thu, 19 Aug 2010 09:25:15 +0900

On Wed, Aug 18, 2010 at 04:58:08PM +0000, Eduardo Horvath wrote:
> On Wed, 18 Aug 2010, Masao Uebayashi wrote:
> 
> > On Mon, Aug 16, 2010 at 05:23:56PM +0900, Masao Uebayashi wrote:
> > > I'm testing XIP on OpenBlockS266 (405GPr).  It works, but it seems very
> > > slow.  If I disable neighbor fault for MADV_NORMAL, /bin/ksh as XIP
> > > starts up in doubled speed, and time -l shows 1/2 page reclaims.
> > > 
> > > I guess this platform needs serious TLB / cache tuning...
> > 
> > There are a mixture of problems:
> > 
> > - Our PowerPC ELF has RWX .data/.got/.plt.  If programs' .data sections
> >   are not aligned to 16KB, those are mapped as "overlay"; pages are always
> >   copied.
> > 
> > - Mapping executable pages by pmap_enter() is very expensive because of the
> >   __syncicache() operation.
> > 
> > - Neighbor fault tends to cause TLB shortage.  This is bad especially if 
> > exec
> >   mappings became victimes.
> > 
> > I can get useful speed on 405GPr now by doing the followings:
> > 
> > - Use static binaries.
> > 
> > - Disable neighbor faults.
> > 
> > *
> > 
> > I'd recommend all 405GPr users to use only static userland, so you
> > get 2x speed...

This was just a work-around for users, not a real solution.

> Actually, let me make a couple of suggestions.
> 
> 1) Instead of static linking, try changing the ELF page size to somehing 
> large than 64KB that way you don't have issues with the different sections 
> crossing page boundaries.

That is one.  Probably this is done by some linker script magic (which
I couldn't figure out in minutes).  Another thing is to support the
read-only PLT [1].

> 2) The TLB supports multiple page sizes.  Add support for multiple page 
> sizes to the OS.  (Probably a largish project.)

ibm4xx already uses 16MB TLB for reserved maps and demand-loaded PA ==
VA mapping of RAM area (pmap_tlbmiss()).  I think this is good enough
for now.

> 3) The TLB page replacement algorithm is all done in software.  Tweak it.

Personally I like such a part to be not very smart, but behave more
predictable way, like simple LRU.

> 4) Optimize pmap so that expensive operations like __syncicache() are only 
> done in pmap_update() and only if needed.

Yeah, something like this should be good.

> I never really spent much time optimizing the 4xx pmap.  I was much more 
> concerned about the copyin/copyout code that still is a lot slower than I 
> would like.
> 
> Eduardo

The "big" ones?  I'm not surprised that vmaprange() -> uvm_km_alloc()
causes TLB shortage and things much slower.  Why can't we use PA == VA
mapping and copy page by page?

In general, ibm4xx uses too many uvm_km_alloc()s.  E.g. page tables.
If we use more PA == VA addresses, TLB should be used much much better.

Masao

[1] http://www.netbsd.org/contrib/soc-projects.html#secureplt

-- 
Masao Uebayashi / Tombi Inc. / Tel: +81-90-9141-4635

References:
- 16KB page ibm4xx performance
  - From: Masao Uebayashi
- Re: 16KB page ibm4xx performance
  - From: Masao Uebayashi
- Re: 16KB page ibm4xx performance
  - From: Eduardo Horvath

Prev by Date: Re: 16KB page ibm4xx performance
Next by Date: hello
Previous by Thread: Re: 16KB page ibm4xx performance
Next by Thread: hello
Indexes:

Home | Main Index | Thread Index | Old Index