aarch64 performance tweaks

To: port-arm%netbsd.org@localhost
Subject: aarch64 performance tweaks
From: Andrew Doran <ad%netbsd.org@localhost>
Date: Tue, 16 Jun 2020 23:31:06 +0000

Hi,

I made some more changes to reduce system time on aarch64 during compile
jobs.  This is as far as I want to go here, I'm finished.  Review would be
appreciated.

Time before & after for an MKCTF=no kernel build on an RK3399:

	643.40 real 3140.42 user 532.59 sys
	632.24 real 3159.67 user 455.31 sys

Thanks,
Andrew


http://www.netbsd.org/~ad/2020/aarch64/atomic.diff

- Remove memory barriers from the atomic ops.  I don't understand why those
  are there.  Is it some architectural thing, or for a CPU bug, or just
  over-caution maybe?  They're not needed for correctness.

- Have unlikely conditional branches go forwards to help the static branch
  predictor.


http://www.netbsd.org/~ad/2020/aarch64/cpu.diff

- Use tpidr_el1 to hold curlwp and not curcpu, because curlwp is accessed
  much more often by MI code.  It also makes curlwp preemption safe and
  allows aarch64_curlwp() to be a const function (curcpu must be volatile).

- Make ASTs operate per-LWP rather than per-CPU, otherwise sometimes LWPs
  can see spurious ASTs (which doesn't cause a problem, it just means some
  time may be wasted).

- Use plain stores to set/clear ASTs.  Make sure ASTs are always set on the
  same CPU as the target LWP, and delivered via IPI if posted from a remote
  CPU so that they are resolved quickly.

- Add some cache line padding to struct cpu_info, to match x86.

- Add a memory barrier in a couple of places where ci_curlwp is set.  This
  is needed whenever an LWP that is resuming on the CPU could hold an
  adaptive mutex.  The barrier needs to drain the CPU's store buffer, so
  that the update to ci_curlwp becomes globally visible before the LWP can
  resume and call mutex_exit().  By my reading of the ARM docs it looks like
  the instruction I used will do the right thing, but I'm not 100% sure.


http://www.netbsd.org/~ad/2020/aarch64/mutex.diff

- Assembly language stubs for mutex_enter() and mutex_exit().


http://www.netbsd.org/~ad/2020/aarch64/pmap.diff

- Implement pmap_growkernel(), and update kernel pmap's stats with atomics.

- Then, pmap_kenter_pa() and pmap_kremove() no longer need to allocate
  memory nor take pm_lock, because they only modify L3 PTEs.

- Then, pm_lock and pp_lock can be adaptive mutexes at IPL_NONE which are
  cheaper than spin mutexes.

- Take the pmap's lock in pmap_extract() if not the kernel's pmap, otherwise
  pmap_extract() might see inconsistent state.

Follow-Ups:
- Re: aarch64 performance tweaks
  - From: Nick Hudson
- Re: aarch64 performance tweaks
  - From: Ryo Shimizu

Prev by Date: Re: aarch64 pmap tweaks for review
Next by Date: Re: 202006111230Z: startx failed on pinebook pro
Previous by Thread: 202006111230Z: startx failed on pinebook pro
Next by Thread: Re: aarch64 performance tweaks
Indexes:

Home | Main Index | Thread Index | Old Index