SIMDlike bittwiddling on oddsized quantities
Author 
Message 
FatPh #1 / 9

SIMDlike bittwiddling on oddsized quantities
I'm trying to perform a whole load of comparisons (equal/notequal, magnitude not important) in parallel, and here's the pseudocode I have at the moment: For example, when dealing with 9bit quantities, 6 at a time e.g. if I'm looking for the values 2,3,5,7,11,13 Firstly I create masks maskc = 0000100000000010000000001000000000100000000010000000001000000000 mask1 = 0000000000000100000000010000000001000000000100000000010000000001 which are the locations of the '1' bit, and of the 'carry' bits. Then for the values 2 3 5 7 11 13 I form the 64bit word vals = 0000000000001000000000110000000101000000011100000010110000001101 OK, that's just preprocessing, placing the important values 10 bits apart, and doesn't need to be fast, it's only done once (for about 30000 values). The rest happens a zillion and two times: If the value I wish to test for inclusion is x, I'll form (((x*mask1) XOR vals)  mask1) AND maskc and test against 0. This expression starts by placing x in the 6 locations. The xor leaves 0 or non0 in each of the 6 locations. The subtract leaves a ve value (carry) or a nonnegative value (no carry) in the 6 locations The and checks each of the carry flags. If one of the values matches x, then one of the carry flags will be set. OK, that's what I've got currently, and it looks fine. _However_ I could get a 14% speedup instantly if I could squeeze _7_ 9bit values into the 64bit word rather than 6. I'm also fighting cache, so making things smaller would be an added benefit. Can anyone see a scheme that doesn't require the extra bit overhead for the carry? I'll not be using just 9bit values, but will have other places where I have 10, 11, 12, 13, 14, 15, and maybe 16bit values. As my normal news server appears dead currently, <<< bash2.05b$ ping news.cis.dfn.de PING news.cis.dfn.de (130.133.1.4): 56 data bytes  news.cis.dfn.de ping statistics  1742 packets transmitted, 0 packets received, 100% packet loss
I'd _really_ appreciate it if a CC: could be made by email, as I find google's lag is painful. Host name 'asdf.org' and username 'fatphil' work, what google have as my address doesn't, thanks to altavista being $%^&'s. Cheers, Phil

Tue, 15 Feb 2005 20:56:03 GMT 


Herbert Kleebaue #2 / 9

SIMDlike bittwiddling on oddsized quantities
Quote:
> I'm trying to perform a whole load of comparisons (equal/notequal, > magnitude not important) in parallel, and here's the pseudocode I have > at the moment: > For example, when dealing with 9bit quantities, 6 at a time > e.g. if I'm looking for the values 2,3,5,7,11,13
But for 9bit values you only need a table of 512 bit (64 byte) for a look up table. Then it takes only one processor instruction for the test. Quote: > I'll not be using just 9bit values, but will have other places where > I have 10, 11, 12, 13, 14, 15, and maybe 16bit values.
Even for a 16 bit value you can do it with a table (8 kbyte). The following program will read a 9 bit number from the keyboard and exit only if the number is 2,3,5,7,11,13,384 or 511.
00000100: e8 0048 _10: bsr.w get_number ; read 9bit number into r1 00000103: 0f a3 16 010b btst.w r1,table ; table look up 00000108: 73 f6 bcc.b _10 ; loop if number not found 0000010a: c3 rts.w ; set: 2,3,5,7,11,13,384,511 0000010b: 000028ac table: dc.l %00000000000000000010100010101100 ; 00310000 0000010f: 00000000 dc.l %00000000000000000000000000000000 ; 00630032 00000113: 00000000 dc.l %00000000000000000000000000000000 ; 00950064 00000117: 00000000 dc.l %00000000000000000000000000000000 ; 01270096 0000011b: 00000000 dc.l %00000000000000000000000000000000 ; 01590128 0000011f: 00000000 dc.l %00000000000000000000000000000000 ; 01910160 00000123: 00000000 dc.l %00000000000000000000000000000000 ; 02230192 00000127: 00000000 dc.l %00000000000000000000000000000000 ; 02550224 0000012b: 00000000 dc.l %00000000000000000000000000000000 ; 02870256 0000012f: 00000000 dc.l %00000000000000000000000000000000 ; 03190288 00000133: 00000000 dc.l %00000000000000000000000000000000 ; 03510320 00000137: 00000000 dc.l %00000000000000000000000000000000 ; 03830352 0000013b: 00000001 dc.l %00000000000000000000000000000001 ; 04150384 0000013f: 00000000 dc.l %00000000000000000000000000000000 ; 04470416 00000143: 00000000 dc.l %00000000000000000000000000000000 ; 04790448 00000147: 80000000 dc.l %10000000000000000000000000000000 ; 05110480 get_number: 0000014b: b8 0e0d _30: move.w #$0e0d,r0 0000014e: cd 10 trap #$10 00000150: b8 0e0a move.w #$0e0a,r0 00000153: cd 10 trap #$10 00000155: 66 31 d2 eor.l r1,r1 00000158: 31 c0 _20: eor.w r0,r0 0000015a: cd 16 trap #$16 0000015c: 3c 0d cmp.b #13,r0 0000015e: 74 22 beq.b _10 00000160: 3c 30 cmp.b #'0',r0 00000162: 72 f4 blo.b _20 00000164: 3c 39 cmp.b #'9',r0 00000166: 77 f0 bhi.b _20 00000168: 66 0f b6 c8 movu.bl r0,r2 0000016c: 80 e9 30 sub.b #'0',r2 0000016f: b4 0e move.b #$0e,m0 00000171: cd 10 trap #$10 00000173: 66 d1 e2 lsl.l #1,r1 00000176: 66 01 d1 add.l r1,r2 00000179: 66 c1 e2 02 lsl.l #2,r1 0000017d: 66 01 ca add.l r2,r1 00000180: eb d6 br.b _20 00000182: 66 81 fa 000001ff _10: cmp.l #%111111111,r1 00000189: 77 c0 bhi.b _30 0000018b: c3 rts.w The binary:
test.com del test.com

Wed, 16 Feb 2005 01:45:58 GMT 


Phil Carmod #3 / 9

SIMDlike bittwiddling on oddsized quantities
Quote:
>> I'm trying to perform a whole load of comparisons (equal/notequal, >> magnitude not important) in parallel, and here's the pseudocode I have >> at the moment: >> For example, when dealing with 9bit quantities, 6 at a time >> e.g. if I'm looking for the values 2,3,5,7,11,13 > But for 9bit values you only need a table of 512 bit (64 byte) > for a look up table. Then it takes only one processor instruction > for the test.
+180 or more wasted instruction slots to get the datum into the cache. The reason I'm using the technique I am is because these tables have now become too large for sensible use. The crossover does appear to be somewhere below the 1024bit level, but depends on several factors, so I want to implement a 9bit version in as well, just in case. Quote: >> I'll not be using just 9bit values, but will have other places where >> I have 10, 11, 12, 13, 14, 15, and maybe 16bit values. > Even for a 16 bit value you can do it with a table (8 kbyte).
Which totally totally kills my cache. I have about 2000 of these arrays, and I nead access to most of them most of the time. With 46 values crammed into a single 64bit word, and on average about 10 values per table, the bitpacking method requires 32KB, and 2 bittwiddling comparisons, rather than several megabytes and 'one processor instruction'. Quote: > The following program will read a 9 bit number from the keyboard > and exit only if the number is 2,3,5,7,11,13,384 or 511.
Thanks, but it is the answer to a different question. Terje Matheson's .sig is relevant here... Phil

Wed, 16 Feb 2005 07:40:32 GMT 


Herbert Kleebaue #4 / 9

SIMDlike bittwiddling on oddsized quantities
Quote:
> >> I'll not be using just 9bit values, but will have other places where > >> I have 10, 11, 12, 13, 14, 15, and maybe 16bit values. > > Even for a 16 bit value you can do it with a table (8 kbyte). > Which totally totally kills my cache.
If this is really a problem, set the memory area for the tables non cacheable. If you don't want to use the complete table, why not use at least a hash function. The simplest way is to use two bitpacked arrays, on for the even and one for the odd values. This way you normally have only to search half of the number of values (even or odd table) and each stored value is one bit shorter. But back to your original question. The following should work (untested): vi: stored values s : search value ++++/.../++ A=  v1  v2  v3   vn  ++++/.../++ ++++/.../++ B=  s  s  s   s  ++++/.../++ ++++/.../++ C=  0...01  0...01  0...01   0...01  ++++/.../++ X = not (A xor B) Y = X add C if CARRY > found (test only necessary if bit length = 2,4,8,16,... Z = (X and C) xor (Y and C) xor C if Z not equal 0 > found But if you really want high speed, use a table lookup!

Thu, 17 Feb 2005 01:14:55 GMT 


Phil Carmod #5 / 9

SIMDlike bittwiddling on oddsized quantities
Quote:
>> >> I'll not be using just 9bit values, but will have other places where >> >> I have 10, 11, 12, 13, 14, 15, and maybe 16bit values. >> > Even for a 16 bit value you can do it with a table (8 kbyte). >> Which totally totally kills my cache. > If this is really a problem, set the memory area for the tables non > cacheable.
But these are the things I want in my cache. All of them. My problem is that when I have 3 MB of them, they tend to start pushing each other out of the cache. Quote: > If you don't want to use the complete table, why not use at least a > hash function. The simplest way is to use two bitpacked arrays, on for > the even and one for the odd values. This way you normally have only to > search half of the number of values (even or odd table) and each stored > value is one bit shorter.
That's certainly an idea  it's applicable to the array lookup setup too. I'll bear it in mind for that context later on. Quote: > But back to your original question. The following should work (untested): > vi: stored values > s : search value > ++++/.../++ > A=  v1  v2  v3   vn  > ++++/.../++ > ++++/.../++ > B=  s  s  s   s  > ++++/.../++ > ++++/.../++ > C=  0...01  0...01  0...01   0...01  > ++++/.../++ > X = not (A xor B) > Y = X add C > if CARRY > found (test only necessary if bit length = 2,4,8,16,... > Z = (X and C) xor (Y and C) xor C > if Z not equal 0 > found
OK, that seems to let the carry flag be part of the next field above. Therefore one can't rely on its boolean value to see if a carry occured, but instead look for a change in its value, which will be concealed by the mere act of adding 1, such that the carry flag staying the same is the indicating feature. e.g. 1) 3 bit fields in an 8bit word. v1=2, v2=0. s=0. A = 00 010 000 B = 00 000 000 X = 11 101 111 C = 00 001 001 Y = 11 111 000 X&C = 00 001 001 Y&C = 00 001 000 Z = 00 001 000 > match :) e.g. 2) 3 bit fields in an 8bit word. v1=0, v2=2. s=0. A = 00 000 010 B = 00 000 000 X = 11 111 101 C = 00 001 001 Y = 01 000 110 X&C = 00 001 001 Y&C = 00 000 000 Z = 00 000 000 > no match detected :( SO there's a slight problem, but it's fixable. The carry detect stage at the top bit should be performed for all packing methods, not just the powers of two (where the carry takes place in an external carry bit (which incidentally I don't have on my processor, but I'll not be needing 16bit values for a while)). However, I think that if I dope the 'unused' part of A with nonzeroes, I can simply pretend there's an extra partial field (which in B will be zeroes). e.g. 3) 3 bit fields in an 8bit word. v1=0, v2=2. s=0. A = 11 000 010 B = 00 000 000 X = 00 111 101 C = 01 001 001 (extra carry bit) Y = 10 000 110 X&C = 00 001 001 Y&C = 00 000 000 Z = 01 000 000 > match detected :) and checking to with the previous example. e.g. 4) 3 bit fields in an 8bit word. v1=2, v2=0. s=0. A = 11 010 000 B = 00 000 000 X = 00 101 111 C = 00 001 001 Y = 01 111 000 X&C = 00 001 001 Y&C = 01 001 000 Z = 00 001 000 > match detected :) Does that look right? I think know this puts the carry flag _exactly_ in the bit above the field, or fields, where the match was, in all cases. Double matches, too, are detected. A = 11 000 000 B = 00 000 000 X = 00 111 111 C = 01 001 001 Y = 10 001 000 X&C = 00 001 001 Y&C = 00 001 000 Z = 01 001 000 > double match detected :) I'll code them both, and see which is faster. Of course, (X and C) xor (Y and C) xor C == (X^Y^C)&C Quote: > But if you really want high speed, use a table lookup!
I _do_ use a table lookup already. As I said in my previous posts, table lookups are no longer faster once the lookups reach a certain size. I've got thousands of runs with timings demonstrating this. The bitmask/list crossover point is a parameter to my code. Before I run a long task I do a quick test to find where this crossover point is. Currently the crossover is somewhere below 1000 bits. I do know that I've got some optimisations to do in the bitfield lookup section of the code, and I can imagine the crossover migrating upwards by a nonnegligible factor. However, I know that my 'one int per entry' list lookup is more in need of optimising currently, and I believe that your technique above holds the key to that. I think I discovered something a long while back that was similar, but had no initial not, and used a subtract rather than an add. My guess is thatthey're probably equivalent. Thanks, Phil

Thu, 17 Feb 2005 06:33:16 GMT 


Phil Carmod #6 / 9

SIMDlike bittwiddling on oddsized quantities
Quote:
> Does that look right? I think know this puts the carry flag _exactly_ in > the bit above the field, or fields, where the match was, in all cases. Double > matches, too, are detected.
The case where Vi^S is 1....10 breaks the above, as it causes a propagated carry that will flag a match in the next column up too. However, as long as strict knowledge about which position isn't required (and it isn't), that's not a problem. Oh, the plain xor and subtract version (rather than xnor and add) does seem to work in exactly the same way. Phil

Thu, 17 Feb 2005 08:08:07 GMT 


Phil Carmod #7 / 9

SIMDlike bittwiddling on oddsized quantities
Quote:
>> Does that look right? I think know this puts the carry flag _exactly_ in >> the bit above the field, or fields, where the match was, in all cases. Double >> matches, too, are detected. > The case where Vi^S is 1....10 breaks the above, as it causes a > propagated carry that will flag a match in the next column up too. > However, as long as strict knowledge about which position isn't required > (and it isn't), that's not a problem. > Oh, the plain xor and subtract version (rather than xnor and add) does > seem to work in exactly the same way.
Further  an explicit 0>1 in the top bit of each field during the subtract indicates a carry, and therefore even the powerof2 sized fields which don't have room for an extra carry bit can be checked precalculation: A = packed list of values U = packed repetition of 1 C = packed repetition of 2^(field size1) test: B = packed repetition of search value X = A^B Y = XU V = (Y&~X)&C interpretation: V = { zero on no matches { nonzero if one of more field matched Does this sound about right? Phil

Thu, 17 Feb 2005 22:14:31 GMT 


Herbert Kleebaue #8 / 9

SIMDlike bittwiddling on oddsized quantities
Quote:
> precalculation: > A = packed list of values > U = packed repetition of 1 > C = packed repetition of 2^(field size1) > test: > B = packed repetition of search value > X = A^B > Y = XU > V = (Y&~X)&C > interpretation: > V = { zero on no matches > { nonzero if one of more field matched > Does this sound about right?
I think so. Your calculation is equivalent to (Xi >= 0) and ( (Xi  1) < 0) which is true only for Xi = 0.

Fri, 18 Feb 2005 01:39:07 GMT 


Phil Carmod #9 / 9

SIMDlike bittwiddling on oddsized quantities
Quote:
>> precalculation: >> A = packed list of values >> U = packed repetition of 1 >> C = packed repetition of 2^(field size1) >> test: >> B = packed repetition of search value >> X = A^B >> Y = XU >> V = (Y&~X)&C >> interpretation: >> V = { zero on no matches >> { nonzero if one of more field matched >> Does this sound about right? > I think so. Your calculation is equivalent to > (Xi >= 0) and ( (Xi  1) < 0) which is true only for Xi = 0.
Thanks for verifying that, Herbert. Carries only propagate noise into the next field up if there's already a match, and therefore the interpretation is unchanged. I should have come up with that initially (and I'm sure it is what I came up with many years back and subsequently forgot) because on the Alpha that I use as my main programming environment there is no carry flag, and therefore if you want to detect carries, you have to perform the overflow checking by hand similar to the above. (I believe this design decision was to reduce overall chip complexity and interintruction dependencies, and while it is a bit shocking when you first meet it after years of Z80/68K/x86 etc. it does kinda make some sense in a twisted RISCheaded way). Cheerio, Phil

Fri, 18 Feb 2005 01:53:57 GMT 


