Subject: Re: copyin/out
To: <>
From: David Laight <david@l8s.co.uk>
List: port-arm
Date: 08/09/2002 23:10:25
> > I got a significant benefit on SA1100 by doing a read ahead of the target
> > address (to pull it into the data cache).  I can't remember the speeds
> > I got (and no longer have a test system).  But this is the byte loop:
> > 	ldrb    r4, [r0], #1
> > 11:	subs    r2, r2, #1
> > 	ldrneb  r5, [r1,#24]
> > 	strb    r4, [r1], #1
> > 	ldrneb  r4, [r0], #1
> > 	bne     11b
> > Only with the 'prefetch' did the order of the instructions matter.
> 
> That makes sense since your stalls were otherwise lost in the noise of
> a cache miss-write-through cycle.  The 'prefetch' pulled the target
> into the cache so you didn't have a cache miss and so the store went
> to the cache instead of to memory.

Yes - the interesting thing is that the timings seemed to imply
that the code wasn't waiting for the prefetch.  I was copying
large buffers (many multiples of the cache size) from a 4k
boundary to 4k + 1.  The prefech at offset 24 was slightly
better than smaller or larger values.
(either that or because of the way all the memory transfers
then get interleaved).

> > How about:
> > 	ands	r6, r1, #3
> > 	addne	pc, pc, r6 lsl #3
> > 	b	Lialend
> > 	nop
> > 	nop
> > 	ldrbt	r7, [r0], #1
> > 	strb	r7, [r1], #1
> > 	ldrbt	r7, [r0], #1
> > 	strb	r7, [r1], #1
> > 	ldrbt	r7, [r0], #1
> > 	strb	r7, [r1], #1
> > 	eor	r6, r6, #2
> > 	sub	r2, r2, r6
> > Lialend:
> 
> I'm not sure I've convinced myself that that's the same thing.  Also,
> you'll have data dep stalls just using r7 there.  Have you tested this?

No! I think it is logically correct (if not something very similar
will work).  If using the same register for adjacent tranfers
generates a stall there are plenty of others!
Is there a result delay on the update of r0 and r1?
> 
> > > 	/* If few bytes left, finish slow. */
> > > 	cmp	r2, #0x08
> > > 	blt	Licleanup
> > 
> > Surely it is worth increasing the number of bytes we enter this
> > path with to ensure that check never takes.
> 
> Possibly.  Although I rather suspect that we often call this code
> with 4 or 8 bytes (granted, probably aligned).  This gets back to the
> histograms...  :-)

Actually you could move the Lialend label to after this test.
(Oh why not use numeric labels for short jumps - saves clutter)

If those values are common, it may be worth doing them without
saving any registers....
> 
> > > 	/* If source is not aligned, finish slow. */
> > > 	ands	r3, r0, #0x03
> > > 	bne	Licleanup
> > Maybe worth checking src and dest have same alignment earlier
> 
> I did this at first, but this increased the number of branches for
> some paths in the code.

I wonder if there are ever significant transfers that actually
align themselves?  Maybe those histograms...
> 
> > For SA1100, if you can do 'stm' writes of 4 words then you don't
> > need to worry about the destination being cached (unless you want
> > the data soon).
> 
> That's interesting.  Can you elaborate some on this?  Or point me to
> a specific location in the manual?

I did some timing for aligned copies, if they are long enough
it doesn't make any differenve if you pull the destination
into the cache.  The reason is that a 4 word stm generates the
same transfer on the memory bus as the cache line write.
A harware engineer at the company I used to work for used a
logic analiser to look at the DRAM timings for the SA1100,
he said he only saw burst writes for cache writes and (IIRC) stm.
(the book implies you get them for other writes...)
- I didn't see those traces.

Of course if you are going to read the destination into the
cache - typical for a short copyin.  You are better off doing
it during the copy....
> 
> > Also aligning the source might be a win on
> > because you could use ldm after an initial ldrt (saving cache).
> 
> Cache & code size.

There are two savings, fetching the instruction words and
not having to refetch the words you displace from the cache.

> > Also, like the byte align code, it ought to be possible to
> > avoid the data read and the 'sub r2,r2,#4' in each case.
> 
> The sub is really cheap because you'd be stuck in a data stall there
> anyway, I believe.

Except that you have to fetch it from main memory.  Unless you are
doing performance measurements when it is likely to still be
resident.  I considered adding an I-cache invalidate to my
timing test loop.
> 
> > I'd try to reduce the code size by only having the 'copy
> > cacheline' present once.  Shouldn't be too hard!
> 
> This is the classic size/speed tradeoff, I believe.  Either
> we have another branch or we may prefetch data we will never
> need.

Can't you branch back to after the prefetch on the last iteraction?
> 
> > A trully horrid loop!
> 
> Thanks, I think.  ;-)

I suspect there may be more misaligned transfers than you've
bargained for.  In which case you may want to avoid the data
stalls - or put the loop control code into them.

	David

-- 
David Laight: david@l8s.co.uk