Subject: Re: Kernel copyin/out optimizations for ARM...
To: None <email@example.com>
From: David Laight <firstname.lastname@example.org>
Date: 03/13/2002 13:10:05
On Tue, Mar 12, 2002 at 10:38:39AM +0000, Richard Earnshaw wrote:
> email@example.com said:
> > While my main interest is the XScale, it seems that some form of
> > improvement may be had for all the arm processors. I was looking at
> > the copyin/out functions, and noticed that after all the checks there
> > is only the check for 'is it bigger than 4 bytes, and is it 32 bit
> > aligned', then copy 32 bit words...
> > It seems that this could be optimized better to use the multiple load
> > features of the cpu to improve copies. The Libc memcpy seems to do
> > this.
> > Is there some reason why this was not done in the kernel?
> > In the case of the XScale, it is capable of doing a 64 bit transfer if
> > things are 'lined up right', and two registers used, and caching is
> > on, etc. etc.
> I've always wondered why that code wasn't written using ldrt/strt for the
> user-space accesses. That would then use the hardware for permission
> checking and eliminate the most expensive part of that code (doing the tlb
> check manually).
I've been remembering stuff from the SA1100 book, and just looked at
the arm26 and arm32 routines. I suspect the arm32 version is necessary!
(However it's copy loop could be improved).
IIRC the SA1100 will access data in its cache without looking at the
TLB/PTE entry. Now unless the permissions from the TLB are saved
with the cache line (Richard might be able to find out) this would
allow users to write to kernel data that is in the d-cache! 
However valid user addresses are know to be a bounded range (8k to
maybe 0xc0000000 - or similar) so this can be checked quite cheaply.
Allowing the kernel permissions to be used for the copy.
The PTE check is actually checking for the 'copy on write' case,
not the 'page not present' case. This may have something to do
with the problems Jason? was having with COW on XSCALE.
My guess is that the cpu wasn't faulting the write to cache! Just the
writeback of the cacheline - which would be asynchonous! 
The 'red tape' of these routines could be improved somewhat:
1) Save regs at start of copyin and copyout, restore at end of
2) Done restore PCB_ONFAULT half way through copyout
3) use ldm/stm for Lcopyinoutloop
4) maybe assume VM_MIN/MAX_USER/KERN_ADDRESS are valid constants
(or grab them with a ldm)
 If the permissions aren't kept in the cache line, the d-cache
must be flushed of all kernel addresses on EVERY return to
 It might be possible to read the cached data (ie the modified
version), invalidate the line, copy the page, write the
saved modified data to the copy.
Of course the fault will be happening well after the copy!
So detecting it (and erroring the user call) might be tricky!
David Laight: firstname.lastname@example.org