Subject: Fast ffs (was Re: Request for testers.)
To: None <cg110@york.ac.uk>
From: None <olly@muscat.co.uk>
List: port-arm32
Date: 04/22/1997 09:29:32
There's an obvious tweak -- use ANDS and NE to avoid the LDRB for ffs(0).
Here's a rewritten version with APCS register names:

ffs
 ; Standard trick to isolate bottom bit in a0 or 0 if a0 = 0 on entry
 rsb     a1, a0, #0
 ands    a0, a0, a1

 ; now a0 has at most one set bit, call this X
 ; if X = 0, all further instructions are skipped
 orrne   a0, a0, a0, lsl #4  ; a0 = X * 0x11
 orrne   a0, a0, a0, lsl #6  ; a0 = X * 0x451
 rsbne   a0, a0, a0, lsl #16 ; a0 = X * 0x0450fbaf

 ; now lookup in table indexed on top 6 bits of a0
 adrne   a2, ffs_table 
 ldrneb  a0, [ a2, a0, lsr #26 ] 
 mov     pc, lr

ffs_table
 <64 byte table>

Now we're only using 32 entries in the table, which makes me wonder if we
can use a smaller table (mind you, this isn't going to make much difference
to the speed, perhaps a small improvement from better caching).

We can pack 0 ... 31 into the 32 bit value 0x07c564dd (there are probably
others possibilities -- I found this one easily enough).

In binary 00000111110001010110010011011101 then 0000 from shifting, and we
get each of 0 ... 31 exactly once by considering sets of 5 consecutive bits.

However, I can't see how to multiply by this value in <= 3 instructions.
Looking at factors of the form 2^n +/- 1 we have 1025, 63, 15, 9, 7, 5, 3
(not all at once), but there's also a prime factor of 673 (1010100001 in
binary).

Cheers,
Olly
-- 
Usenet is not for answering questions but for questioning answers.