Subject: Re: Kernel copyin/out optimizations for ARM...
To: John Clark <j1clark@ucsd.edu>
From: Richard Earnshaw <rearnsha@arm.com>
List: port-arm
Date: 03/14/2002 11:06:26
> =

> Am Mittwoch den, 13. M=E4rz 2002, um 17:27, schrieb Richard Earnshaw:
> =

> >
> > Nah! far more likely that the original author didn't know about ldrt!=

> > Otherwise, the absence of a comment explaining why they aren't used i=
s
> > unforgivable.
> =

> I wasn't the urschreiber of the copy code, I did start this thread =

> thoough...

You did know that once you've spawned a thread, it takes on a life of its=
 =

own ;-)

> and what I was wondering was not so much the lines that lead up to the
> copy loop, which need to test via the mmu, and of course if there are
> instructions which permit that easily, heck use them... but my question=
 =

> was
> once the rang has been found to be valid for copy/in/out, why is there
> only one optimization for the case of 32 bit aligned transfers, and not=

> a 'cache line' worth, by using the ldm/stm type of instructions, or the=
 =

> like.

Sure, but my suspicions are that that isn't where most of the time is =

being lost.  NetBSD/arm's PTE tables are not cached, so any accesses to =

them are VERRRRRYYYYY slow, particularly from a StrongARM or similarly =

sophisticated chip.

> I checked out the i386 code, and in some cases a floating point load
> is used to grab some number of 64 bit 'words' into the FPU, then
> spews the collected data back out. On other occasions there is use
> of the 'rep string' operation.
> =

> Another idea I've seen on this was in ancient Mac OS where a number of =

> 'move with
> auto increment' instructions were lined up, a test for 'how much data
> to move', and a jump to an indexed location in the move instructions
> which when the instructions where complete all the data would be =

> transfered.
> (Look ma, no decrement, test, jump, just transfers...).  Although this =

> would
> break cache for some 'large' number of transfers...
> =


We don't want to start using the FPU for things like that (even if we hav=
e =

one, which is rarely the case on existing ARM chips).  We don't want the =

kernel to touch the FPU registers, except for context switch handling.

Unless the areas being copied are not cacheable (in which case things wil=
l =

always be dog slow, regardless of what we do elsewhere), the read parts o=
f =

the copy loops will be entirely dominated by the cache-line misses.  =

Writes are slightly different, particularly on caches that don't =

allocate-on-write; it's true that in this case a store multiple will be =

beneficial, since then the writes can be put into the write buffer =

together (though IIRC, the SA will merge writes to a cacheable area =

anyway).

R.