port-arm32: Re: Cacheing parts of podule space

Subject: Re: Cacheing parts of podule space
To: Reinoud Zandijk <zandijk@cs.utwente.nl>
From: Richard Earnshaw <rearnsha@arm.com>
List: port-arm32
Date: 09/09/1999 11:22:19
zandijk@cs.utwente.nl said:
> >  Now, instead of 
> > the truly sucking 100K/s or thereabouts that I used to get, I now get 
> > about 900K/s with much lower load on the machine.
>
> Great! I haven't got a AKA-31 SCSI card, but its allways great to see
> such a performance boost! 

But it's not as good as it could be -- the internal IDE card gets about 
1.3M/s with PIO.  The spec for the SCSI card says that it should be 
possible to transfer data to & from the buffer memory at up to 6M/s 
(though I wouldn't expect to see that much from the card in real use; in 
fact, if I could get half that I would be very happy, but that's still a 
factor of 3 away).

> Yep it is. As far as I know, there are 3 different cycletypes for each
> podule in podule-space wich are selected by using an offset in it's podule
> space (in RiscOS) :  slow, medium, fast. Have you checked them ? Are you
> using EASI ? i.e. 32 bit bus transfers? does the card support that?
> Another ``trick'' is to use the pipeline of the StrongARM i.e. process
> information while retrieving i.e. interleave loading the next 8/16 bit
> word while packing the bytes allready retrieved to 32 bits : you get the
> extra cycles for free as long as you avoid register-locking, and thus can
> be faster compared to ldm/stm and then packing afterwards.

The AKA-31 (there are also AKA-30 & AKA-32 boards, which seem to be 
substantially the same spec) is an old (v. old) board that was designed 
for the Acorn RISC iX machines; as such it conforms to the old podule 
specs.  ie.  No EASI, 16-bit transfers, no DMA to podule space.  I'm only 
using it 'cos I happen to have one lying around; I wouldn't recommend 
anyone to go out and buy one... even if they could.

> 
> Dunno much about the RiscPC's hardware by heart, but is it posible to use
> a MEMC DMA channel to get the data from the card? Would/can it be faster?

No, the board doesn't support it.  See above.

> 
> > So, to the question.  Is there a way to map just one page of podule-space 
> > (the page where the buffer memory is mapped) to be cached/bufferable?  I 
> > really think that on a strongarm this will be a sufficient win to make 
> > syncing the cache during such transfers a price worth paying.
> 
> Dunno the interiors that good, but I guess it'll be possible... can't see
> why it couldn't be done. But are the bus-space routines using ldm/stm?
> Could they be optimized? Or is the gain neglectible...

The code that actually does the transfer to the buffer memory is already 
highly optimized assembler.  Because of the differences in speeds 
(processor running at 233 MHz, Podule @ 8MHz) each podule cycle is the 
equivalent of nearly 30 internal cycles which is enough time to prepare 
about 20 registers if I had that many -- I can marshal 2 registers in 3 
cycles.  So in a single podule cycle there's enough time to prepare 
several registers for writing to podule space (or to unmarshal several 
reads).  What is needed is a way to eliminate the synchronization 
overheads from the way the StrongARM splits its bus transactions up when 
the memory is not buffered.

On traditional ARMs (eg ARM6, ARM7) a STM r0, {r1-r4} would generate the 
following bus traffic:

	N-cycle
	S-cycle
	S-cycle
	S-cycle

and in effect does not release the bus during this time.

On StrongARM, each ldm/stm is expanded internally in the pipeline to a 
series of ldr/str instructions; these are then recombined in the cache so 
that reads and writes are streamed.  But when the cache is off, it comes 
out as

	N-cycle
	N-cycle
	N-cycle
	N-cycle

N-cycles often take twice as long S-cycles, and there is probably an 
additional cycle of synchronization when the main bus keeps 
re-synchronizing to the podule bus.  Except for the first word on each 
cache line, the overhead cycles would all be eliminated if the memory were 
cached.

Richard.