port-arm32: Re: Port of NetBSD to XScale

Subject: Re: Port of NetBSD to XScale
To: Charles M. Hannum <root@ihack.net>
From: Chris Gilbert <chris@paradox.demon.co.uk>
List: port-arm32
Date: 03/29/2001 22:57:38
On Thursday 29 March 2001  9:06 pm, Ignatios Souvatzis wrote:
> On Thu, Mar 29, 2001 at 11:45:39AM -0800, Charles M. Hannum wrote:
> > On Thu, Mar 29, 2001 at 11:42:40AM -0800, Charles M. Hannum wrote:
> > > On Thu, Mar 29, 2001 at 11:08:13AM +0100, Richard Earnshaw wrote:
> > > > According to the documentation I have, Xscale only predicts B and BL
> > > > instructions, both of which only have pc-relative invariant offsets. 
> > > > Any mis-predicted (or unpredicted) branch takes at least 5 cycles to
> > > > issue (8 if the value has to come from memory). [XScale Developers
> > > > Manual, Table 14-4]

I've not got a table 14-4 in mine, which is from:
http://developer.intel.com/design/intelxscale/273473.htm

> > > So a function return always takes 5 clock cycles??  Was this thing
> > > developed by the same group that did the P4, perchance??

Possibly it's got the branch prediction buffer (I'm sure that's i386 like) oh 
and the "Intel (r) XScale (tm) Core RISC Superpipeline" has branches,  with a 
an execution pipeline of 7 parts, MAC pipeline (multiply accumulate) 4 from 
the main pipeline and 3 from it's own (the last one can loop) and the data 
pipeline with 5 from the main pipeline and 3 of it's own.  So the pipeline 
now has 13 components.  The SA only had 5 in one straight pipe AFAIR.

> > Shit, and another 5 cycles for every PIC function call.  This is gonna
> > suck a lot.
>
> Maybe they want to make sure it is only used for embedded applications, not
> for general OSes.

Well it is clocked a lot faster, so in theory it might come in around the 
same speed as the SA. provided the memory can keep up (but then it has got 
the 32k of instr and 32k of data cache (which you can extra things with)

Oh there's also some fun with flushing the data cache out, if you thought the 
SA was fun, this ones bigger, has to use a mmu copro call, and THEN you flush 
out something called the mini-data cache!

I wonder how it performs on real code (with shared libs etc), not benchmarks. 
 Perhaps it really is just an embedded chip?

Or maybe I'm just too tired to think rationally (much more likely, 
considering my branch with a register this morning :)

Cheers,
Chris