Subject: Re: Cacheing parts of podule space
To: Reinoud Zandijk <zandijk@cs.utwente.nl>
From: Richard Earnshaw <rearnsha@arm.com>
List: port-arm32
Date: 09/09/1999 11:22:19
zandijk@cs.utwente.nl said:
> > Now, instead of
> > the truly sucking 100K/s or thereabouts that I used to get, I now get
> > about 900K/s with much lower load on the machine.
>
> Great! I haven't got a AKA-31 SCSI card, but its allways great to see
> such a performance boost!
But it's not as good as it could be -- the internal IDE card gets about
1.3M/s with PIO. The spec for the SCSI card says that it should be
possible to transfer data to & from the buffer memory at up to 6M/s
(though I wouldn't expect to see that much from the card in real use; in
fact, if I could get half that I would be very happy, but that's still a
factor of 3 away).
> Yep it is. As far as I know, there are 3 different cycletypes for each
> podule in podule-space wich are selected by using an offset in it's podule
> space (in RiscOS) : slow, medium, fast. Have you checked them ? Are you
> using EASI ? i.e. 32 bit bus transfers? does the card support that?
> Another ``trick'' is to use the pipeline of the StrongARM i.e. process
> information while retrieving i.e. interleave loading the next 8/16 bit
> word while packing the bytes allready retrieved to 32 bits : you get the
> extra cycles for free as long as you avoid register-locking, and thus can
> be faster compared to ldm/stm and then packing afterwards.
The AKA-31 (there are also AKA-30 & AKA-32 boards, which seem to be
substantially the same spec) is an old (v. old) board that was designed
for the Acorn RISC iX machines; as such it conforms to the old podule
specs. ie. No EASI, 16-bit transfers, no DMA to podule space. I'm only
using it 'cos I happen to have one lying around; I wouldn't recommend
anyone to go out and buy one... even if they could.
>
> Dunno much about the RiscPC's hardware by heart, but is it posible to use
> a MEMC DMA channel to get the data from the card? Would/can it be faster?
No, the board doesn't support it. See above.
>
> > So, to the question. Is there a way to map just one page of podule-space
> > (the page where the buffer memory is mapped) to be cached/bufferable? I
> > really think that on a strongarm this will be a sufficient win to make
> > syncing the cache during such transfers a price worth paying.
>
> Dunno the interiors that good, but I guess it'll be possible... can't see
> why it couldn't be done. But are the bus-space routines using ldm/stm?
> Could they be optimized? Or is the gain neglectible...
The code that actually does the transfer to the buffer memory is already
highly optimized assembler. Because of the differences in speeds
(processor running at 233 MHz, Podule @ 8MHz) each podule cycle is the
equivalent of nearly 30 internal cycles which is enough time to prepare
about 20 registers if I had that many -- I can marshal 2 registers in 3
cycles. So in a single podule cycle there's enough time to prepare
several registers for writing to podule space (or to unmarshal several
reads). What is needed is a way to eliminate the synchronization
overheads from the way the StrongARM splits its bus transactions up when
the memory is not buffered.
On traditional ARMs (eg ARM6, ARM7) a STM r0, {r1-r4} would generate the
following bus traffic:
N-cycle
S-cycle
S-cycle
S-cycle
and in effect does not release the bus during this time.
On StrongARM, each ldm/stm is expanded internally in the pipeline to a
series of ldr/str instructions; these are then recombined in the cache so
that reads and writes are streamed. But when the cache is off, it comes
out as
N-cycle
N-cycle
N-cycle
N-cycle
N-cycles often take twice as long S-cycles, and there is probably an
additional cycle of synchronization when the main bus keeps
re-synchronizing to the podule bus. Except for the first word on each
cache line, the overhead cycles would all be eliminated if the memory were
cached.
Richard.