Subject: Re: Xscale optimisations
To: David Laight <david@l8s.co.uk>
From: Steve Woodford <scw@wasabisystems.com>
List: port-arm
Date: 10/14/2003 13:47:20
On Tuesday 14 October 2003 12:28 pm, David Laight wrote:
> > 	- significant improvements to some mem*() library functions,
>
> Are those a real improvement?
> In particular when the code isn't in the I$ ?

I've benchmarked various combinations of micro-optimisations on the 
Xscale, and what you see in the current code is what gave the best 
all-round results.

> Other experiments have shown that they are very often called
> with short transfer lengths, and that the cost of deciding which
> algorithm to use can become dominant.

Yup. The short/misaligned memcpy code is borderline, but in network 
throughput tests, it gives a slight improvement.

> Also, IIRC, the strongarm doesn't execute stmgeia quickly if the
> condition is false.  Having 16 in a row must be worth a branch?

My brief was to optimise for Xscale. If I've added non-optimal code for 
non-Xscale cpus, then that's probably due to me not being as careful 
with that part of the code. Volunteers to fix it are more than welcome 
:)

> [using mini D$] ought to benefit SA1100/1110 (110?) systems as well.
>
> Does anyone know if the SA1100 ever generates a memory burst for a
> stmia that write that misses the cache?

See above re. concentrating on Xscale. ;-)

Cheers, Steve

-- 

Wasabi Systems Inc. - The NetBSD Company - http://www.wasabisystems.com/