Subject: Re: Kernel copyin/out optimizations for ARM...
To: John Clark <j1clark@ucsd.edu>
From: Richard Earnshaw <rearnsha@arm.com>
List: port-arm
Date: 03/14/2002 11:06:26
> =
> Am Mittwoch den, 13. M=E4rz 2002, um 17:27, schrieb Richard Earnshaw:
> =
> >
> > Nah! far more likely that the original author didn't know about ldrt!=
> > Otherwise, the absence of a comment explaining why they aren't used i=
s
> > unforgivable.
> =
> I wasn't the urschreiber of the copy code, I did start this thread =
> thoough...
You did know that once you've spawned a thread, it takes on a life of its=
=
own ;-)
> and what I was wondering was not so much the lines that lead up to the
> copy loop, which need to test via the mmu, and of course if there are
> instructions which permit that easily, heck use them... but my question=
=
> was
> once the rang has been found to be valid for copy/in/out, why is there
> only one optimization for the case of 32 bit aligned transfers, and not=
> a 'cache line' worth, by using the ldm/stm type of instructions, or the=
=
> like.
Sure, but my suspicions are that that isn't where most of the time is =
being lost. NetBSD/arm's PTE tables are not cached, so any accesses to =
them are VERRRRRYYYYY slow, particularly from a StrongARM or similarly =
sophisticated chip.
> I checked out the i386 code, and in some cases a floating point load
> is used to grab some number of 64 bit 'words' into the FPU, then
> spews the collected data back out. On other occasions there is use
> of the 'rep string' operation.
> =
> Another idea I've seen on this was in ancient Mac OS where a number of =
> 'move with
> auto increment' instructions were lined up, a test for 'how much data
> to move', and a jump to an indexed location in the move instructions
> which when the instructions where complete all the data would be =
> transfered.
> (Look ma, no decrement, test, jump, just transfers...). Although this =
> would
> break cache for some 'large' number of transfers...
> =
We don't want to start using the FPU for things like that (even if we hav=
e =
one, which is rarely the case on existing ARM chips). We don't want the =
kernel to touch the FPU registers, except for context switch handling.
Unless the areas being copied are not cacheable (in which case things wil=
l =
always be dog slow, regardless of what we do elsewhere), the read parts o=
f =
the copy loops will be entirely dominated by the cache-line misses. =
Writes are slightly different, particularly on caches that don't =
allocate-on-write; it's true that in this case a store multiple will be =
beneficial, since then the writes can be put into the write buffer =
together (though IIRC, the SA will merge writes to a cacheable area =
anyway).
R.