SIMD-like bit-twiddling on odd-sized quantities 
Author Message
 SIMD-like bit-twiddling on odd-sized quantities

I'm trying to perform a whole load of comparisons (equal/not-equal,
magnitude not important) in parallel, and here's the pseudocode I have
at the moment:

For example, when dealing with 9-bit quantities, 6 at a time
e.g. if I'm looking for the values 2,3,5,7,11,13

Firstly I create masks
maskc = 0000100000000010000000001000000000100000000010000000001000000000
mask1 = 0000000000000100000000010000000001000000000100000000010000000001
which are the locations of the '1' bit, and of the 'carry' bits.
Then for the values
                  2         3         5         7        11        13
I form the 64-bit word
vals =  0000000000001000000000110000000101000000011100000010110000001101

OK, that's just preprocessing, placing the important values 10 bits
apart, and doesn't need to be fast, it's only done once (for about
30000 values).

The rest happens a zillion and two times:

If the value I wish to test for inclusion is x, I'll form
 (((x*mask1) XOR vals) - mask1) AND maskc
and test against 0.

This expression starts by placing x in the 6 locations.
The xor leaves 0 or non-0 in each of the 6 locations.
The subtract leaves a -ve value (carry) or a non-negative value (no
carry) in the 6 locations
The and checks each of the carry flags.

If one of the values matches x, then one of the carry flags will be
set.

OK, that's what I've got currently, and it looks fine. _However_ I
could get a 14% speedup instantly if I could squeeze _7_ 9-bit values
into the 64-bit word rather than 6. I'm also fighting cache, so making
things smaller would be an added benefit.

Can anyone see a scheme that doesn't require the extra bit overhead
for the carry?

I'll not be using just 9-bit values, but will have other places where
I have 10, 11, 12, 13, 14, 15, and maybe 16-bit values.

As my normal news server appears dead currently,
<<<
bash-2.05b$ ping news.cis.dfn.de
PING news.cis.dfn.de (130.133.1.4): 56 data bytes
--- news.cis.dfn.de ping statistics ---
1742 packets transmitted, 0 packets received, 100% packet loss
I'd _really_ appreciate it if a CC: could be made by e-mail, as I find
google's lag is painful. Host name 'asdf.org' and username 'fatphil'
work, what google have as my address doesn't, thanks to altavista
being $%^&'s.

Cheers,
Phil



Tue, 15 Feb 2005 20:56:03 GMT  
 SIMD-like bit-twiddling on odd-sized quantities

Quote:

> I'm trying to perform a whole load of comparisons (equal/not-equal,
> magnitude not important) in parallel, and here's the pseudocode I have
> at the moment:

> For example, when dealing with 9-bit quantities, 6 at a time
> e.g. if I'm looking for the values 2,3,5,7,11,13

But for 9-bit values you only need a table of 512 bit (64 byte)
for a look up table. Then it takes only one processor instruction
for the test.

Quote:
> I'll not be using just 9-bit values, but will have other places where
> I have 10, 11, 12, 13, 14, 15, and maybe 16-bit values.

Even for a 16 bit value you can do it with a table (8 kbyte).

The following program will read a 9 bit number from the keyboard
and exit only if the number is 2,3,5,7,11,13,384 or 511.


00000100: e8 0048             _10:    bsr.w   get_number ; read 9-bit number into r1
00000103: 0f a3 16 010b               btst.w  r1,table   ; table look up
00000108: 73 f6                       bcc.b   _10        ; loop if number not found
0000010a: c3                          rts.w

                                      ; set: 2,3,5,7,11,13,384,511
0000010b: 000028ac            table:  dc.l    %00000000000000000010100010101100 ; 0031-0000
0000010f: 00000000                    dc.l    %00000000000000000000000000000000 ; 0063-0032
00000113: 00000000                    dc.l    %00000000000000000000000000000000 ; 0095-0064
00000117: 00000000                    dc.l    %00000000000000000000000000000000 ; 0127-0096
0000011b: 00000000                    dc.l    %00000000000000000000000000000000 ; 0159-0128
0000011f: 00000000                    dc.l    %00000000000000000000000000000000 ; 0191-0160
00000123: 00000000                    dc.l    %00000000000000000000000000000000 ; 0223-0192
00000127: 00000000                    dc.l    %00000000000000000000000000000000 ; 0255-0224
0000012b: 00000000                    dc.l    %00000000000000000000000000000000 ; 0287-0256
0000012f: 00000000                    dc.l    %00000000000000000000000000000000 ; 0319-0288
00000133: 00000000                    dc.l    %00000000000000000000000000000000 ; 0351-0320
00000137: 00000000                    dc.l    %00000000000000000000000000000000 ; 0383-0352
0000013b: 00000001                    dc.l    %00000000000000000000000000000001 ; 0415-0384
0000013f: 00000000                    dc.l    %00000000000000000000000000000000 ; 0447-0416
00000143: 00000000                    dc.l    %00000000000000000000000000000000 ; 0479-0448
00000147: 80000000                    dc.l    %10000000000000000000000000000000 ; 0511-0480

                              get_number:
0000014b: b8 0e0d             _30:    move.w  #$0e0d,r0
0000014e: cd 10                       trap    #$10
00000150: b8 0e0a                     move.w  #$0e0a,r0
00000153: cd 10                       trap    #$10
00000155: 66 31 d2                    eor.l   r1,r1
00000158: 31 c0               _20:    eor.w   r0,r0
0000015a: cd 16                       trap    #$16
0000015c: 3c 0d                       cmp.b   #13,r0
0000015e: 74 22                       beq.b   _10
00000160: 3c 30                       cmp.b   #'0',r0
00000162: 72 f4                       blo.b   _20
00000164: 3c 39                       cmp.b   #'9',r0
00000166: 77 f0                       bhi.b   _20
00000168: 66 0f b6 c8                 movu.bl r0,r2
0000016c: 80 e9 30                    sub.b   #'0',r2

0000016f: b4 0e                       move.b  #$0e,m0
00000171: cd 10                       trap    #$10

00000173: 66 d1 e2                    lsl.l   #1,r1
00000176: 66 01 d1                    add.l   r1,r2
00000179: 66 c1 e2 02                 lsl.l   #2,r1
0000017d: 66 01 ca                    add.l   r2,r1
00000180: eb d6                       br.b    _20
00000182: 66 81 fa 000001ff   _10:    cmp.l   #%111111111,r1
00000189: 77 c0                       bhi.b   _30
0000018b: c3                          rts.w

The binary:







test.com
del test.com



Wed, 16 Feb 2005 01:45:58 GMT  
 SIMD-like bit-twiddling on odd-sized quantities

Quote:


>> I'm trying to perform a whole load of comparisons (equal/not-equal,
>> magnitude not important) in parallel, and here's the pseudocode I have
>> at the moment:

>> For example, when dealing with 9-bit quantities, 6 at a time
>> e.g. if I'm looking for the values 2,3,5,7,11,13

> But for 9-bit values you only need a table of 512 bit (64 byte)
> for a look up table. Then it takes only one processor instruction
> for the test.

+180 or more wasted instruction slots to get the datum into the cache.

The reason I'm using the technique I am is because these tables have now
become too large for sensible use. The cross-over does appear to be
somewhere below the 1024-bit level, but depends on several factors, so I
want to implement a 9-bit version in as well, just in case.

Quote:
>> I'll not be using just 9-bit values, but will have other places where
>> I have 10, 11, 12, 13, 14, 15, and maybe 16-bit values.

> Even for a 16 bit value you can do it with a table (8 kbyte).

Which totally totally kills my cache.
I have about 2000 of these arrays, and I nead access to most of them most
of the time. With 4-6 values crammed into a single 64-bit word, and on
average about 10 values per table, the bit-packing method requires 32KB,
and 2 bit-twiddling comparisons, rather than several megabytes and
'one processor instruction'.

Quote:
> The following program will read a 9 bit number from the keyboard
> and exit only if the number is 2,3,5,7,11,13,384 or 511.

Thanks, but it is the answer to a different question.

Terje Matheson's .sig is relevant here...

Phil



Wed, 16 Feb 2005 07:40:32 GMT  
 SIMD-like bit-twiddling on odd-sized quantities

Quote:

> >> I'll not be using just 9-bit values, but will have other places where
> >> I have 10, 11, 12, 13, 14, 15, and maybe 16-bit values.

> > Even for a 16 bit value you can do it with a table (8 kbyte).

> Which totally totally kills my cache.

If this is really a problem, set the memory area for the tables non
cacheable.

If you don't want to use the complete table, why not use at least a
hash function. The simplest way is to use two bit-packed arrays, on for
the even and one for the odd values. This way you normally have only to
search half of the number of values (even or odd table) and each stored
value is one bit shorter.

But back to your original question. The following should work (untested):

vi: stored values
s : search value

      +--------+--------+--------+---/.../-----+--------+
A=    |   v1   |   v2   |   v3   |             |   vn   |
      +--------+--------+--------+---/.../-----+--------+

      +--------+--------+--------+---/.../-----+--------+
B=    |   s    |   s    |   s    |             |   s    |
      +--------+--------+--------+---/.../-----+--------+

      +--------+--------+--------+---/.../-----+--------+
C=    | 0...01 | 0...01 | 0...01 |             | 0...01 |
      +--------+--------+--------+---/.../-----+--------+

X = not (A xor B)
Y = X add C
if CARRY -> found  (test only necessary if bit length = 2,4,8,16,...
Z = (X and C) xor (Y and C) xor C
if Z not equal 0 -> found

But if you really want high speed, use a table lookup!



Thu, 17 Feb 2005 01:14:55 GMT  
 SIMD-like bit-twiddling on odd-sized quantities

Quote:


>> >> I'll not be using just 9-bit values, but will have other places where
>> >> I have 10, 11, 12, 13, 14, 15, and maybe 16-bit values.

>> > Even for a 16 bit value you can do it with a table (8 kbyte).

>> Which totally totally kills my cache.

> If this is really a problem, set the memory area for the tables non
> cacheable.

But these are the things I want in my cache. All of them. My problem is
that when I have 3 MB of them, they tend to start pushing each other out
of the cache.

Quote:
> If you don't want to use the complete table, why not use at least a
> hash function. The simplest way is to use two bit-packed arrays, on for
> the even and one for the odd values. This way you normally have only to
> search half of the number of values (even or odd table) and each stored
> value is one bit shorter.

That's certainly an idea - it's applicable to the array lookup setup too.
I'll bear it in mind for that context later on.

Quote:
> But back to your original question. The following should work (untested):

> vi: stored values
> s : search value

>       +--------+--------+--------+---/.../-----+--------+
> A=    |   v1   |   v2   |   v3   |             |   vn   |
>       +--------+--------+--------+---/.../-----+--------+

>       +--------+--------+--------+---/.../-----+--------+
> B=    |   s    |   s    |   s    |             |   s    |
>       +--------+--------+--------+---/.../-----+--------+

>       +--------+--------+--------+---/.../-----+--------+
> C=    | 0...01 | 0...01 | 0...01 |             | 0...01 |
>       +--------+--------+--------+---/.../-----+--------+

> X = not (A xor B)
> Y = X add C
> if CARRY -> found  (test only necessary if bit length = 2,4,8,16,...
> Z = (X and C) xor (Y and C) xor C
> if Z not equal 0 -> found

OK, that seems to let the carry flag be part of the next field above.
Therefore one can't rely on its boolean value to see if a carry occured,
but instead look for a change in its value, which will be concealed by
the mere act of adding 1, such that the carry flag staying the same is
the indicating feature.

e.g. 1) 3 bit fields in an 8-bit word. v1=2, v2=0. s=0.
  A = 00 010 000
  B = 00 000 000
  X = 11 101 111
  C = 00 001 001
  Y = 11 111 000
X&C = 00 001 001
Y&C = 00 001 000
  Z = 00 001 000 -> match :-)

e.g. 2) 3 bit fields in an 8-bit word. v1=0, v2=2. s=0.
  A = 00 000 010
  B = 00 000 000
  X = 11 111 101
  C = 00 001 001
  Y = 01 000 110
X&C = 00 001 001
Y&C = 00 000 000
  Z = 00 000 000 -> no match detected :-(

SO there's a slight problem, but it's fixable. The carry detect stage at
the top bit should be performed for all packing methods, not just the
powers of two (where the carry takes place in an external carry bit
(which incidentally I don't have on my processor, but I'll not be needing
16-bit values for a while)).

However, I think that if I dope the 'unused' part of A with non-zeroes, I can
simply pretend there's an extra partial field (which in B will be zeroes).

e.g. 3) 3 bit fields in an 8-bit word. v1=0, v2=2. s=0.
  A = 11 000 010
  B = 00 000 000
  X = 00 111 101
  C = 01 001 001 (extra carry bit)
  Y = 10 000 110
X&C = 00 001 001
Y&C = 00 000 000
  Z = 01 000 000 -> match detected :-)

and checking to with the previous example.
e.g. 4) 3 bit fields in an 8-bit word. v1=2, v2=0. s=0.
  A = 11 010 000
  B = 00 000 000
  X = 00 101 111
  C = 00 001 001
  Y = 01 111 000
X&C = 00 001 001
Y&C = 01 001 000
  Z = 00 001 000 -> match detected :-)

Does that look right? I think know this puts the carry flag _exactly_ in
the bit above the field, or fields, where the match was, in all cases. Double
matches, too, are detected.

  A = 11 000 000
  B = 00 000 000
  X = 00 111 111
  C = 01 001 001
  Y = 10 001 000
X&C = 00 001 001
Y&C = 00 001 000
  Z = 01 001 000 -> double match detected :-)

I'll code them both, and see which is faster.

Of course, (X and C) xor (Y and C) xor C == (X^Y^C)&C

Quote:
> But if you really want high speed, use a table lookup!

I _do_ use a table lookup already. As I said in my previous posts,
table lookups are no longer faster once the lookups reach a certain
size. I've got thousands of runs with timings demonstrating this.
The bitmask/list cross-over point is a parameter to my code. Before
I run a long task I do a quick test to find where this cross-over
point is. Currently the cross-over is somewhere below 1000 bits. I do
know that I've got some optimisations to do in the bitfield lookup
section of the code, and I can imagine the cross-over migrating upwards
by a non-negligible factor. However, I know that my 'one int per entry'
list lookup is more in need of optimising currently, and I believe that
your technique above holds the key to that.

I think I discovered something a long while back that was similar, but
had no initial not, and used a subtract rather than an add. My guess
is thatthey're probably equivalent.
Thanks,
Phil



Thu, 17 Feb 2005 06:33:16 GMT  
 SIMD-like bit-twiddling on odd-sized quantities

Quote:

> Does that look right? I think know this puts the carry flag _exactly_ in
> the bit above the field, or fields, where the match was, in all cases. Double
> matches, too, are detected.

The case where Vi^S is 1....10 breaks the above, as it causes a
propagated carry that will flag a match in the next column up too.

However, as long as strict knowledge about which position isn't required
(and it isn't), that's not a problem.

Oh, the plain xor and subtract version (rather than xnor and add) does
seem to work in exactly the same way.

Phil



Thu, 17 Feb 2005 08:08:07 GMT  
 SIMD-like bit-twiddling on odd-sized quantities

Quote:


>> Does that look right? I think know this puts the carry flag _exactly_ in
>> the bit above the field, or fields, where the match was, in all cases. Double
>> matches, too, are detected.

> The case where Vi^S is 1....10 breaks the above, as it causes a
> propagated carry that will flag a match in the next column up too.

> However, as long as strict knowledge about which position isn't required
> (and it isn't), that's not a problem.

> Oh, the plain xor and subtract version (rather than xnor and add) does
> seem to work in exactly the same way.

Further - an explicit 0->1 in the top bit of each field during the
subtract indicates a carry, and therefore even the power-of-2 sized
fields which don't have room for an extra carry bit can be checked

precalculation:
A = packed list of values
U = packed repetition of 1
C = packed repetition of 2^(field size-1)

test:
B = packed repetition of search value
X = A^B
Y = X-U
V = (Y&~X)&C

interpretation:
V = { zero     on no matches
    { non-zero if one of more field matched

Does this sound about right?

Phil



Thu, 17 Feb 2005 22:14:31 GMT  
 SIMD-like bit-twiddling on odd-sized quantities

Quote:

> precalculation:
> A = packed list of values
> U = packed repetition of 1
> C = packed repetition of 2^(field size-1)

> test:
> B = packed repetition of search value
> X = A^B
> Y = X-U
> V = (Y&~X)&C

> interpretation:
> V = { zero     on no matches
>     { non-zero if one of more field matched

> Does this sound about right?

I think so. Your calculation is equivalent to

 (Xi >= 0) and ( (Xi - 1) < 0) which is true only for Xi = 0.



Fri, 18 Feb 2005 01:39:07 GMT  
 SIMD-like bit-twiddling on odd-sized quantities

Quote:


>> precalculation:
>> A = packed list of values
>> U = packed repetition of 1
>> C = packed repetition of 2^(field size-1)

>> test:
>> B = packed repetition of search value
>> X = A^B
>> Y = X-U
>> V = (Y&~X)&C

>> interpretation:
>> V = { zero     on no matches
>>     { non-zero if one of more field matched

>> Does this sound about right?

> I think so. Your calculation is equivalent to

>  (Xi >= 0) and ( (Xi - 1) < 0) which is true only for Xi = 0.

Thanks for verifying that, Herbert.

Carries only propagate noise into the next field up if there's already a
match, and therefore the interpretation is unchanged.

I should have come up with that initially (and I'm sure it is what I came
up with many years back and subsequently forgot) because on the Alpha
that I use as my main programming environment there is no carry flag, and
therefore if you want to detect carries, you have to perform the overflow
checking by hand similar to the above. (I believe this design decision was
to reduce overall chip complexity and inter-intruction dependencies, and
while it is a bit shocking when you first meet it after years of
Z80/68K/x86 etc. it does kinda make some sense in a twisted RISC-headed
way).

Cheerio,
Phil



Fri, 18 Feb 2005 01:53:57 GMT  
 
 [ 9 post ] 

 Relevant Pages 

1. Bit Twiddling??

2. Bit twiddling on Parallel port

3. UUDECODE - Bit-twiddling in Arexx

4. Data Path synthesis Verilog(controlling bit-widths with signed quantities)

5. New Mathematical bit-twiddle - stage on of a isPerfectSquare()

6. Data Path synthesis Verilog(controlling bit-widths with signed quantities)

7. Crazy bit-twiddle algorithms needed!

8. Bit twiddling functions

9. Bit twiddling functions

10. Byte/Bit Twiddling in Ada

11. Bit Twiddling in IBM (LE) COBOL

12. help - bit twiddling anomoly

 

 
Powered by phpBB® Forum Software