Subject: Re: ARM cache
To: Toru Nishimura <locore64@alkyltechnology.com>
From: Jesse Off <joff@embeddedARM.com>
List: port-arm
Date: 01/16/2005 20:45:31
>> IIRC, context switches are very expensive on arm due to the
>> virtually-indexed L1 cache flushing.  Is the pagedaemon kernel thread
>> the
>> one needing to run this often and therefore competing with ttcp?
>
> pagedaemon is one of "all-kernel-space" runtime context and makes no
> address space switching.  It's true that virtual-indexed nature of ARM
> processor is very ill-fit with Unix style process model OS anyway.

True, but involuntary context switches counter are somehow being
incremented nonetheless.  Perhaps its time for myself to take a dig in the
schedular code.  Maybe the counter is always being incremented every 100ms
whether or not a context switch actually takes place?

>
>> * bypass most dmamap_sync() and use DMA_COHERENT mappings
>
> I'm afraid ARM cache is not designed for bus snooping.  Given relatively
> high associativity with small capacity in size the possiblity of cache
> inconsistency is statistically small for real load, however it would
> happen.

I understand that.  What I was doing before was letting the L1 cache cache
my buffers, but then use bus_dma_sync to write-back,
write-back-invalidate, or invalidate individual buffers/address ranges
before/after DMA took place.  My suspicions were that by using uncacheable
mappings, the SDRAM controller on the ep9302 would not issue SDRAM bursts
asynchronously since it would not have a cache-line to fill/writeback. 
Without SDRAM bursts to amortize out the setup bus-cycles, each memory
access would take additional penalties of RAS, CAS, and potentially
PRECHARGE bus-cycles.  If all this happens synchronously and stalls the
processor, one stands to take a ~8 tick penalty @SDRAM speed on each
load/store to these buffers.  In my initial implementation I thought it
worthwhile to avoid that.

What I found was that even though this may have been the case, profiling
revealed a slight advantage to using DMA_COHERENT mappings since the
overhead of the bus_dma calls whittled away any advantage gained by the
above.  Another reason I should have profiled first. :-/

//Jesse Off