Subject: Re: ARM bswap optimizations
To: Jason R Thorpe <thorpej@wasabisystems.com>
From: Richard Earnshaw <rearnsha@buzzard.freeserve.co.uk>
List: port-arm
Date: 08/14/2002 00:06:01
> Hi folks...
> 
> I'm wanting to shave some cycles out of the TCP/IP code on ARM.  hton*()
> and ntoh*() is low-hanging fruit.  The issues:
> 
> 	* Constants are not byte-swapped at compile-time.
> 
> 	* A function must be called to do the byte-swap.  This costs
> 	  3 cycles to call the function (one to branch, 2 for the
> 	  pipeline flush), and 3 cycles to return.  This is significant
> 	  overhead if you consider that it's 4 insns to byte-swap an int,
> 	  and 3 insns to byte-swap a short.
> 
> The following patch addresses these issues.  I'd appreciate it if people
> would read it over to make sure that I didn't screw up the asm (mostly
> the constraints :-)  I've booted it multi-user on an XScale.

Writing your inline as

inline u_int32_t
__byte_swap_long_var(u_int32_t v)
{
	u_int32_t t1, t2, t3;

	t1 = v ^ ((v << 16) | v >> 16);
	t2 = t1 & 0xff00ffff;
	t3 = (v >> 8) | (v << 24);
	return t3 ^ (t2 >> 8);
}

enables gcc to generate a sequence that is only one instruction longer (5 
rather than 4 instructions -- and a pattern to eliminate the fifth could 
be fairly easily added to gcc).  It has the added advantage that the 
compiler will do any constant reduction for you.  Eg:

u_int32_t foo()
{
	return (__byte_swap_long_var(0x01234567));
}

compiles as:

foo:
        mov     ip, sp
        stmfd   sp!, {fp, ip, lr, pc}
        ldr     r0, .L5
        sub     fp, ip, #4
        ldmea   fp, {fp, sp, pc}
.L6:
        .align  0
.L5:
        .word   1732584193	@ = 0x67452301

The main advantage of leaving it as C code is that the compiler can 
schedule the instructions individually.

Similar, simpler code can be done for half-words.

R.