Port-arm archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
aarch64 performance tweaks
Hi,
I made some more changes to reduce system time on aarch64 during compile
jobs. This is as far as I want to go here, I'm finished. Review would be
appreciated.
Time before & after for an MKCTF=no kernel build on an RK3399:
643.40 real 3140.42 user 532.59 sys
632.24 real 3159.67 user 455.31 sys
Thanks,
Andrew
http://www.netbsd.org/~ad/2020/aarch64/atomic.diff
- Remove memory barriers from the atomic ops. I don't understand why those
are there. Is it some architectural thing, or for a CPU bug, or just
over-caution maybe? They're not needed for correctness.
- Have unlikely conditional branches go forwards to help the static branch
predictor.
http://www.netbsd.org/~ad/2020/aarch64/cpu.diff
- Use tpidr_el1 to hold curlwp and not curcpu, because curlwp is accessed
much more often by MI code. It also makes curlwp preemption safe and
allows aarch64_curlwp() to be a const function (curcpu must be volatile).
- Make ASTs operate per-LWP rather than per-CPU, otherwise sometimes LWPs
can see spurious ASTs (which doesn't cause a problem, it just means some
time may be wasted).
- Use plain stores to set/clear ASTs. Make sure ASTs are always set on the
same CPU as the target LWP, and delivered via IPI if posted from a remote
CPU so that they are resolved quickly.
- Add some cache line padding to struct cpu_info, to match x86.
- Add a memory barrier in a couple of places where ci_curlwp is set. This
is needed whenever an LWP that is resuming on the CPU could hold an
adaptive mutex. The barrier needs to drain the CPU's store buffer, so
that the update to ci_curlwp becomes globally visible before the LWP can
resume and call mutex_exit(). By my reading of the ARM docs it looks like
the instruction I used will do the right thing, but I'm not 100% sure.
http://www.netbsd.org/~ad/2020/aarch64/mutex.diff
- Assembly language stubs for mutex_enter() and mutex_exit().
http://www.netbsd.org/~ad/2020/aarch64/pmap.diff
- Implement pmap_growkernel(), and update kernel pmap's stats with atomics.
- Then, pmap_kenter_pa() and pmap_kremove() no longer need to allocate
memory nor take pm_lock, because they only modify L3 PTEs.
- Then, pm_lock and pp_lock can be adaptive mutexes at IPL_NONE which are
cheaper than spin mutexes.
- Take the pmap's lock in pmap_extract() if not the kernel's pmap, otherwise
pmap_extract() might see inconsistent state.
Home |
Main Index |
Thread Index |
Old Index