Subject: Re: Kernel copyin/out optimizations for ARM...
To: None <Richard.Earnshaw@buzzard.freeserve.co.uk>
From: John Clark <j1clark@ucsd.edu>
List: port-arm
Date: 03/13/2002 19:32:50
Am Mittwoch den, 13. M=E4rz 2002, um 17:27, schrieb Richard Earnshaw:

>
> Nah! far more likely that the original author didn't know about ldrt!
> Otherwise, the absence of a comment explaining why they aren't used is
> unforgivable.

I wasn't the urschreiber of the copy code, I did start this thread=20
thoough...
and what I was wondering was not so much the lines that lead up to the
copy loop, which need to test via the mmu, and of course if there are
instructions which permit that easily, heck use them... but my question=20=

was
once the rang has been found to be valid for copy/in/out, why is there
only one optimization for the case of 32 bit aligned transfers, and not
a 'cache line' worth, by using the ldm/stm type of instructions, or the=20=

like.

I checked out the i386 code, and in some cases a floating point load
is used to grab some number of 64 bit 'words' into the FPU, then
spews the collected data back out. On other occasions there is use
of the 'rep string' operation.

Another idea I've seen on this was in ancient Mac OS where a number of=20=

'move with
auto increment' instructions were lined up, a test for 'how much data
to move', and a jump to an indexed location in the move instructions
which when the instructions where complete all the data would be=20
transfered.
(Look ma, no decrement, test, jump, just transfers...).  Although this=20=

would
break cache for some 'large' number of transfers...