Subject: Re: Kernel copyin/out optimizations for ARM...
To: None <Richard.Earnshaw@arm.com>
From: John Clark <j1clark@ucsd.edu>
List: port-arm
Date: 03/14/2002 07:10:56
Am Donnerstag den, 14. M=E4rz 2002, um 03:06, schrieb Richard Earnshaw:

>
> We don't want to start using the FPU for things like that (even if we=20=

> have
> one, which is rarely the case on existing ARM chips).  We don't want =
the
> kernel to touch the FPU registers, except for context switch handling.

I wasn't suggesting that in the ARM case such a method would be either
usable or desirable. I just cited it as an example of what someone has
done to speed things up by taking advantage of a feature of an=20
architecture,
in that case the i386. In the case of old Mot 68K lining up 256 move=20
instructions
was 'faster'... etc.

In the case the ARM for 'fast' transfers, it seems that the use of=20
multiple
loads per instruction has some potential, and as I read the current=20
arm/arm32/bcopyinout.S, there's only a test for 32 bit alignment,
otherwise byte copy, rather than tests for 64 or 128 bit alignments,
and using the multiple load ops. (or so, from memory of a few days=20
ago...)

Now if someone who is really familiar with the details of ARM=20
implementations
states that such multiple load ops are really dogger than dog slow, and
the only truely fast ones are 32 bit load/store operations, I'll start=20=

looking
at the DMA  engine on the XScale companion chip...