Subject: Re: improving ssh performance on sun4m systems
To: David Laight <david@l8s.co.uk>
From: Michael Thompson <m_thompson@ids.net>
List: port-sparc
Date: 03/14/2002 08:34:14
Take a look at the Astronautics ZS-1. 

http://ricm.museum.com/collections/astronautics/zs-1.html

This was the first decoupled access/execute architecture machine and the
compilers unroll loops to improve performance. The PowerPC and MIPS CPUs
use a similar architecture.

You are correct that this machine did not have cache but did have
instruction pipelines. The DEA architecture masks the slow memory access
times by having the access processor run asynchronously with the execute
processor. The access processor tries to keep the execute processor's
pipeline full so it is not limited by memory access times.

This machine benchmarked at 45 MIPS in 1988.


At 08:21 AM 3/14/02 +0000, David Laight wrote:
>On Thu, Mar 14, 2002 at 05:41:12AM +0300, Valeriy E. Ushakov wrote:
>> On Thu, Mar 14, 2002 at 13:29:52 +1100, matthew green wrote:
>> 
>> > interesting that "-O2" is better than "-O3" and
>> > "-O3 -fomit-frame-pointer".
>> 
>> gcc can sometimes do some weird decisions about inlining (that -O3
>> turns on).  Perhaps that is the case and you start to pay the price of
>> faulting in those extra pages?  I've seen -O3 blowing the file 5 times
>> b/c a simple very frequently used helper functions were inlined all
>> over the place.
>
>Yes - both inlining and unrolling loops can cause the codes active
>size to increase, this has 2 detremental effects:
>1) more instructions must be read into the cache to execute a loop
>2) other code is displaced from the I-Cache.
>
>The net effect is that the benchmark runs faster, but any real
>workload runs slower!
>
>I suspect some of these optimisations were invented when systems
>didn't have non-trivial caches, and when memory speeds were
>similar to execution speed.
>
>	David
>
>-- 
>David Laight: david@l8s.co.uk
>
>

Michael Thompson
E-Mail: M_Thompson@IDS.net