Subject: Re: shocking speed performance!
To: None <port-arm32@netbsd.org>
From: Peter Teichmann <teich-p@Rcs1.urz.tu-dresden.de>
List: port-arm32
Date: 05/20/1999 18:37:46
In message <81F5585A3B93D111A8D10080ADB4CBB90CA285@DC>
          Bruce Martin <BruceM@cat.co.za> wrote:

> We've just written a jpeg compression algorithm that takes a
> raw (RGB) image file and converts it to a jpeg. When we run
> this on a Pentium II, it takes 103ms, on a Pentium120 S it
> takes 340 ms, and on the EBSA-285 board with a Strongarm
> 233MHz, it takes 650ms!
> 
> Can anyone explain to me why the Strongarm takes so much
> longer than the Pentium to run this code? Surely it should
> run in the same range, if not faster. By the way, the
> Pentium II 266 version was only compiled with the -m486
> option, so no special PII optimization was used...

Well, I also had similar "problems". There are some reasons which
can cause this (anyway, I still have only a RiscPC with Strongarm)

1. the Pentium hass a superscalar architecture with 2 units, so under
   certain circumstances (suitable algorithm, good optimizing compiler)
   a Pentium 120 MHz can be as fast as a Strongarm 100MHz
   
2. gcc for x86 is certainly much more optimized than for arm32 as there
   are much more people working on it
   
3. The Pentium has only very few registers, but it can do math calculations
   with register operands, and the penalty for that is mostly hidden in the
   long pipeline so that virtually nearly no penalty exists. I think this is
   a very important fact.

4. The Strongarm has no L2 cache, so non-continuous memory accesses outside
   from the internal cache are slower than with a Pentium. Remember the
   d-cache architecture (16K, 32-way associative, lines are 8 words long)
   This has often effect on algorithms with larger data sizes.
   
I found that compilers do often no good job on Strongarms, hand optimized
assembler code is often twice as fast or more. You also often need to change
the data structures to overcome problems resulting from (4). Sometimes it
is also useful to change the algorithm slightly to get a better performance.

On x86 I did not have much success with hand optimized code, it was never
much faster.

-- 
Peter Teichmann

----------------------------------------------------------------------------
Email: teich-p@rcs.urz.tu-dresden.de  WWW: rcswww.urz.tu-dresden.de/~teich-p