Subject: Re: Cacheing parts of podule space
To: None <richard.earnshaw@arm.com>
From: Reinoud Zandijk <zandijk@cs.utwente.nl>
List: port-arm32
Date: 09/09/1999 11:40:57
Hi Richard,
On Thu, 9 Sep 1999, Richard Earnshaw wrote:
> I finally got fed up with the sucky performance of my Acorn AKA-31 SCSI
> card, so I've re-written the driver for it to make use of the on-board
> buffer memory that can be used for DMA to the SCSI bus. Now, instead of
> the truly sucking 100K/s or thereabouts that I used to get, I now get
> about 900K/s with much lower load on the machine.
Great! I haven't got a AKA-31 SCSI card, but its allways great to see such
a performance boost!
> However, I've hit a brick wall; the bottleneck now seems to be the time
> taken to transfer the data to the buffer memory. The podule-space is
> mapped uncached (sounds reasonable, you say), but on a StrongARM this
> means that ldm/stm transfers are not buffered or streamed, so the hardware
> in effect breaks out each load/store in the instruction into a separate
> bus transaction, which probably means that the throughput to the buffer
> memory is divided by at least 2 and probably 3 (I forget the details).
> Ouch! Further, these cycles are all running at the podule bus speed,
> again I forget the numbers, but that's something like 8MHz.
Yep it is. As far as I know, there are 3 different cycletypes for each
podule in podule-space wich are selected by using an offset in it's podule
space (in RiscOS) : slow, medium, fast. Have you checked them ? Are you
using EASI ? i.e. 32 bit bus transfers? does the card support that?
Another ``trick'' is to use the pipeline of the StrongARM i.e. process
information while retrieving i.e. interleave loading the next 8/16 bit
word while packing the bytes allready retrieved to 32 bits : you get the
extra cycles for free as long as you avoid register-locking, and thus can
be faster compared to ldm/stm and then packing afterwards.
Dunno much about the RiscPC's hardware by heart, but is it posible to use
a MEMC DMA channel to get the data from the card? Would/can it be faster?
> So, to the question. Is there a way to map just one page of podule-space
> (the page where the buffer memory is mapped) to be cached/bufferable? I
> really think that on a strongarm this will be a sufficient win to make
> syncing the cache during such transfers a price worth paying.
Dunno the interiors that good, but I guess it'll be possible... can't see
why it couldn't be done. But are the bus-space routines using ldm/stm?
Could they be optimized? Or is the gain neglectible...
Regards,
Reinoud