Subject: Re: bswap{16,32,64} in libutil ?
To: Manuel Bouyer <bouyer@antioche.lip6.fr>
From: Eduardo E. Horvath <eeh@one-o.com>
List: tech-userlevel
Date: 03/05/1998 08:41:04
On Wed, 4 Mar 1998, Manuel Bouyer wrote:

> On Mar 4, Eduardo E. Horvath wrote
> > Is it possible to break these functions down into "store as little-endian"
> > and "store as big endian"?  There are architectures where this is a big
> > performance issue.  Byte swapping on SPARC V9 machines is a complicated
> > process involving lots of shifts and masking, but storing to a particular
> > endianness is practically a NOP.  Machines that really do swaps can
> > have one set of macros as NOPs and the others do the swap.  (Or do I just
> > assume that the only time bswap*() is called on a big endian machine is to
> > access little-endian data and vice-versa?)
> > 
> 
> If they also can do 'read as little-endian' and 'read as big endian', it will
> be possible. If only store instructions are available, it will be harder,
> as there are places where convertions are done 'on the fly', such as
>  var = ufs_bswap32(ufs_bswap32(var) +1); /* formely vas var++ */

It does reads as well.  For those of you who really want to know the nitty
gritty details:

There is a bit in the PSTATE register that tells the machine to run in
little-enidan mode.  There is another bit that tells the machine to take
traps in little-endian mode.  The MMU's TTE has a bit to indicate that a
particular page has its endiannes inverted.  There are also ASIs that
specify whether to acces a particular location as little endian or big
endian.  This makes the generation of a bswap macro a bit difficult.

I suppose the most efficient way of doing this would be to put a #pragma
on the datatype to indicate the endianness and let the compiler deal with
it.  Otherwise the above code would reqire something on the order of 2
loads and 2 stores, eg:

	ldwa 	[var] ASI_PRIMARY_LITTLE, %o0	; ufs_bswap32(var)
	inc	%o0				; var = var + 1
	stw	%o0, [tmp]			; tmp = var
	ldwa	[tmp] ASI_PRIMARY_LITTLE, %o0	; ufs_bswap32(tmp)
	stw	%o, [var]			; var = tmp

An optimized version should be:

        ldwa    [var] ASI_PRIMARY_LITTLE, %o0   ; ufs_bswap32(var)
        inc     %o0                             ; var = var + 1
	stwa	%o0, [var] ASI_PRIMARY_LITTLE	; var = ufs_bswap32(var)

Since there are already gcc macros for lda and sta (and ldha (16-bit),
stha, ldxa (64-bit), and stxa) the macros could be designed something like
this:

#define load32_little(addr)	lda((addr),ASI_PRIMARY_LITTLE)
#define load32_big(addr)	lda((addr),ASI_PRIMARY)
#define load16_little(addr)	ldha((addr),ASI_PRIMARY_LITTLE)
#define load16_big(addr)	ldha((addr),ASI_PRIMARY)
#define load64_little(addr)	ldxa((addr),ASI_PRIMARY_LITTLE)
#define load64_big(addr)	ldxa((addr),ASI_PRIMARY)
#define store32_little(addr,v)	sta((addr),(v),ASI_PRIMARY_LITTLE)
#define store32_big(addr,v)	sta((addr),(v),ASI_PRIMARY)
#define store16_little(addr,v)	stha((addr),(v),ASI_PRIMARY_LITTLE)
#define store16_big(addr,v)	stha((addr),(v),ASI_PRIMARY)
#define store64_little(addr,v)	stxa((addr),(v),ASI_PRIMARY_LITTLE)
#define store64_big(addr,v)	stxa((addr),(v),ASI_PRIMARY)

Unfortunately, mapping this to an in-register bswap is non-trivial.  I'll
need to take another close look at the V9 spec and see if there's some
efficient way of doing this inside a register.

=========================================================================
Eduardo Horvath				eeh@one-o.com
"Cliffs are for climbing.  That's why God invented grappling hooks."
					- Benton Frasier