port-arm32: Re: Cacheing parts of podule space

Subject: Re: Cacheing parts of podule space
To: None <richard.earnshaw@arm.com>
From: Reinoud Zandijk <zandijk@cs.utwente.nl>
List: port-arm32
Date: 09/09/1999 11:40:57
Hi Richard,

On Thu, 9 Sep 1999, Richard Earnshaw wrote:
> I finally got fed up with the sucky performance of my Acorn AKA-31 SCSI 
> card, so I've re-written the driver for it to make use of the on-board 
> buffer memory that can be used for DMA to the SCSI bus.  Now, instead of 
> the truly sucking 100K/s or thereabouts that I used to get, I now get 
> about 900K/s with much lower load on the machine.

Great! I haven't got a AKA-31 SCSI card, but its allways great to see such
a performance boost! 

> However, I've hit a brick wall; the bottleneck now seems to be the time 
> taken to transfer the data to the buffer memory.  The podule-space is 
> mapped uncached (sounds reasonable, you say), but on a StrongARM this 
> means that ldm/stm transfers are not buffered or streamed, so the hardware 
> in effect breaks out each load/store in the instruction into a separate 
> bus transaction, which probably means that the throughput to the buffer 
> memory is divided by at least 2 and probably 3 (I forget the details).  
> Ouch!  Further, these cycles are all running at the podule bus speed, 
> again I forget the numbers, but that's something like 8MHz.

Yep it is. As far as I know, there are 3 different cycletypes for each
podule in podule-space wich are selected by using an offset in it's podule
space (in RiscOS) :  slow, medium, fast. Have you checked them ? Are you
using EASI ? i.e. 32 bit bus transfers? does the card support that?
Another ``trick'' is to use the pipeline of the StrongARM i.e. process
information while retrieving i.e. interleave loading the next 8/16 bit
word while packing the bytes allready retrieved to 32 bits : you get the
extra cycles for free as long as you avoid register-locking, and thus can
be faster compared to ldm/stm and then packing afterwards.

Dunno much about the RiscPC's hardware by heart, but is it posible to use
a MEMC DMA channel to get the data from the card? Would/can it be faster?

> So, to the question.  Is there a way to map just one page of podule-space 
> (the page where the buffer memory is mapped) to be cached/bufferable?  I 
> really think that on a strongarm this will be a sufficient win to make 
> syncing the cache during such transfers a price worth paying.

Dunno the interiors that good, but I guess it'll be possible... can't see
why it couldn't be done. But are the bus-space routines using ldm/stm?
Could they be optimized? Or is the gain neglectible...

Regards,

Reinoud