Crazy bit-twiddle algorithms needed! 
Author Message
 Crazy bit-twiddle algorithms needed!

I'm trying to throw lots of bits into lots of buckets. It is more of an
'algorithmic' brainstorm I'm after rather than actual code.
Say I have a 130 bit bitstring (in memory) with arbitrary bits set. I
wish to then count the number of bits set in each of the N cyclic bit
posistions.
e.g. if N=17, I want Sum[0] =
bit[0]+bit[17]+bit[34]+bit[51]+...+bit[119]
And of course I want all 17 sums.

I want to do this for a whole range of N, but generally N less than the
word size will be most common.
The length of the bitstring will be constant for the runtime of the
code. I am prepared to rebuild my app for every run, but I'd rather have
a data-driven approach.
I am prepared to have a separate routine for every N (there will be
about 15 values used), but again I'd rather keep my L1 clean, and have a
simpler data driven approach. (though having say 3 versions of the code
for small/medium/large N would be great).

I think that having 17 buckets in the above is wasteful, as each sum can
only reach 7. So in theory I could do something like the following
(forgive the high level notation, I think better with symbols!)

Case N=19:

Input[0..4] holds the data

Mask3s = 01001001001001001001001001001001b
Maskto19 = 01111111111111111111b

Bucket0 = 0
Bucket1 = 0
Bucket2 = 0

; First word
Bucket0 |= Input[0]&Mask3s
Bucket1 |= (Input[0]>>1)&Mask3s
Bucket2 |= (Input[0]>>2)&Mask3s
; Now combine the top bits with the low bits
Bucket0 = (Bucket0&Maskto19)+(Bucket0>>19)
Bucket1 = (Bucket1&Maskto19)+(Bucket1>>19)
Bucket1 = (Bucket2&Maskto19)+(Bucket2>>19)

; second word
; Now what? I can't rotate, as the top bits don't belong below the
bottom ones.
; So do I need to treat this as 3 chunks?
;  - the top bit of the 2nd 19 bits; the whole 3rd span of 19 bits; the
bottom of the 4th span?
; etc...

The other thing I thought of was trying to make 19-bits the basic
quantity, rather than the word width. So I'd step through the words more
slowly, but have simpler operations at each stage.

The other thing I thought of was the 'noddy' method.
Scan each bit. If it's set, put it in the approprate bucket.

The _other_ thing I though of was to have the numbers in a list rather
than as a bitstring. Then I do a mod on each number, and increment the
resulting bucket. I want to avoid DIVs, though, as they'll be slow in
the processor family I'll be using.

I'm sure there must be a 'sum bits in parallel' method, but the
multi-word nature increases the number of possibilities. What do people
advise?

Phil



Sat, 05 Jul 2003 22:14:47 GMT  
 Crazy bit-twiddle algorithms needed!

mh... I think that's fairly easy

input: esi = pointer to bit-stream
       ecx = number of bits to accumulate
       edx = distance between bits (=17 if you want to add every 17th
bit)
       edi = first bit to test

output: eax = number of set bits

  xor ebx, ebx
  xor eax, eax
  loop:
  bt [esi], edi  // copies bit no. (edi) of bit-stream to carry-flag
  setc bl        // sets bl=1 if carry set, otherwise 0
  add eax, ebx   // accumulate to eax
  add edi, edx   // next bit...
  dec ecx
  jnz loop

assumes of course, you're working in a 32 bit flat memory model... a
386er processor is at least required for the BT and SETcc commands

Christian



Sun, 06 Jul 2003 05:48:14 GMT  
 Crazy bit-twiddle algorithms needed!

I suspect I'd mask the large value with a mask containing ones in the
desired bit positions (this mask coming from a lookup table) and
then using the generic bit counting algorithm.  The table lookup
is somewhat expensive, but I suspect it would be faster than
other simple alternatives.
Randy Hyde


Quote:
> I'm trying to throw lots of bits into lots of buckets. It is more of an
> 'algorithmic' brainstorm I'm after rather than actual code.
> Say I have a 130 bit bitstring (in memory) with arbitrary bits set. I
> wish to then count the number of bits set in each of the N cyclic bit
> posistions.
> e.g. if N=17, I want Sum[0] =
> bit[0]+bit[17]+bit[34]+bit[51]+...+bit[119]
> And of course I want all 17 sums.

> I want to do this for a whole range of N, but generally N less than the
> word size will be most common.
> The length of the bitstring will be constant for the runtime of the
> code. I am prepared to rebuild my app for every run, but I'd rather have
> a data-driven approach.
> I am prepared to have a separate routine for every N (there will be
> about 15 values used), but again I'd rather keep my L1 clean, and have a
> simpler data driven approach. (though having say 3 versions of the code
> for small/medium/large N would be great).

> I think that having 17 buckets in the above is wasteful, as each sum can
> only reach 7. So in theory I could do something like the following
> (forgive the high level notation, I think better with symbols!)

> Case N=19:

> Input[0..4] holds the data

> Mask3s = 01001001001001001001001001001001b
> Maskto19 = 01111111111111111111b

> Bucket0 = 0
> Bucket1 = 0
> Bucket2 = 0

> ; First word
> Bucket0 |= Input[0]&Mask3s
> Bucket1 |= (Input[0]>>1)&Mask3s
> Bucket2 |= (Input[0]>>2)&Mask3s
> ; Now combine the top bits with the low bits
> Bucket0 = (Bucket0&Maskto19)+(Bucket0>>19)
> Bucket1 = (Bucket1&Maskto19)+(Bucket1>>19)
> Bucket1 = (Bucket2&Maskto19)+(Bucket2>>19)

> ; second word
> ; Now what? I can't rotate, as the top bits don't belong below the
> bottom ones.
> ; So do I need to treat this as 3 chunks?
> ;  - the top bit of the 2nd 19 bits; the whole 3rd span of 19 bits; the
> bottom of the 4th span?
> ; etc...

> The other thing I thought of was trying to make 19-bits the basic
> quantity, rather than the word width. So I'd step through the words more
> slowly, but have simpler operations at each stage.

> The other thing I thought of was the 'noddy' method.
> Scan each bit. If it's set, put it in the approprate bucket.

> The _other_ thing I though of was to have the numbers in a list rather
> than as a bitstring. Then I do a mod on each number, and increment the
> resulting bucket. I want to avoid DIVs, though, as they'll be slow in
> the processor family I'll be using.

> I'm sure there must be a 'sum bits in parallel' method, but the
> multi-word nature increases the number of possibilities. What do people
> advise?

> Phil



Sun, 06 Jul 2003 05:48:19 GMT  
 Crazy bit-twiddle algorithms needed!

Quote:

> I'm trying to throw lots of bits into lots of buckets. It is more of an
> 'algorithmic' brainstorm I'm after rather than actual code.
> Say I have a 130 bit bitstring (in memory) with arbitrary bits set. I
> wish to then count the number of bits set in each of the N cyclic bit
> posistions.
> e.g. if N=17, I want Sum[0] =
> bit[0]+bit[17]+bit[34]+bit[51]+...+bit[119]
> And of course I want all 17 sums.

> I want to do this for a whole range of N, but generally N less than the
> word size will be most common.
> The length of the bitstring will be constant for the runtime of the
> code. I am prepared to rebuild my app for every run, but I'd rather have
> a data-driven approach.
> I am prepared to have a separate routine for every N (there will be
> about 15 values used), but again I'd rather keep my L1 clean, and have a
> simpler data driven approach. (though having say 3 versions of the code
> for small/medium/large N would be great).

> I think that having 17 buckets in the above is wasteful, as each sum can
> only reach 7. So in theory I could do something like the following
> (forgive the high level notation, I think better with symbols!)

> Case N=19:

> Input[0..4] holds the data

> Mask3s = 01001001001001001001001001001001b
> Maskto19 = 01111111111111111111b

> Bucket0 = 0
> Bucket1 = 0
> Bucket2 = 0

> ; First word
> Bucket0 |= Input[0]&Mask3s
> Bucket1 |= (Input[0]>>1)&Mask3s
> Bucket2 |= (Input[0]>>2)&Mask3s
> ; Now combine the top bits with the low bits
> Bucket0 = (Bucket0&Maskto19)+(Bucket0>>19)
> Bucket1 = (Bucket1&Maskto19)+(Bucket1>>19)
> Bucket1 = (Bucket2&Maskto19)+(Bucket2>>19)

> ; second word
> ; Now what? I can't rotate, as the top bits don't belong below the
> bottom ones.
> ; So do I need to treat this as 3 chunks?
> ;  - the top bit of the 2nd 19 bits; the whole 3rd span of 19 bits; the
> bottom of the 4th span?
> ; etc...

> The other thing I thought of was trying to make 19-bits the basic
> quantity, rather than the word width. So I'd step through the words more
> slowly, but have simpler operations at each stage.

> The other thing I thought of was the 'noddy' method.
> Scan each bit. If it's set, put it in the approprate bucket.

> The _other_ thing I though of was to have the numbers in a list rather
> than as a bitstring. Then I do a mod on each number, and increment the
> resulting bucket. I want to avoid DIVs, though, as they'll be slow in
> the processor family I'll be using.

> I'm sure there must be a 'sum bits in parallel' method, but the
> multi-word nature increases the number of possibilities. What do people
> advise?

Assuming you have to total each case (from 0 to N-1), it
seems the 'noddy' method, as you say, is the only practical
way to go.  I would set up a routine like this:

;esi = bit-string pointer
;ebx = bit bucket pointer
;ch  = number of dwords in bit string
;cl  = cycle count (N)

        mov    dl,cl      ;cycle count

        dec    ch         ;dword counter

        mov    dh,32      ;32 bits in dword
        lodsd             ;get dword in eax

        shr    eax,1      ;lowest bit into carry
        adc    cp[edi],0  ;add to total
        inc    edi        ;next bit bucket
        dec    dl         ;cycle counter

        mov    dl,cl      ;reset cycle counter
        mov    edi,ebx    ;reset bit bucket pointer
        sub    dh,cl      ;reduce bit count by N

        add    dh,cl      ;restore deficit


        shr    eax,1      ;lowest bit into carry
        adc    cp[edi],0  ;add to total
        inc    edi        ;next bit bucket
        dec    dh         ;dword done yet?


I haven't tested this.  Hope it helps.

--
Richard Pavlicek
Web site: http://www.rpbridge.net



Sun, 06 Jul 2003 07:42:47 GMT  
 Crazy bit-twiddle algorithms needed!

How about an algorithm where you take 17 bytes (or whatever the
maximum number will be under your current N-order) and shift the bits
from your source through the carry into each one in sequence (the
first bit into the first byte, the 2nd bit into the 2nd byte, 3rd into
the 3rd....17th into the 17th, then 18th into 1st, etc.

When you're finished you'll have 17 bytes, each one representing its
order sum via itself through a fixed table.  Logically, there will be
a bit in each location in the byte from the correct locations obtained
from the source.  And to find the sum do a simple lookup based against
a fixed table using itself as the index.  The result will tell the
number of set bits.

If your order falls below a certain threshold then your lookup will
have to be increased to accommodate.  At that point it might be a
trade off for storage space and a simple post-processing count-bit
algorithm. :)

Your lookup table would be such that:
00000000b = 0
00000001b = 1
00000010b = 1
00000011b = 2
00000100b = 1
00000101b = 2
00000111b = 3
00001000b = 1
..
..
..
11111100b = 6
11111101b = 7
11111111b = 8

    mov     esi,offset source_bits
    mov     ecx,total_num_of_bits
    mov     edi,offset output_area
    mov     edx,edi
    add     edx,17  ; Whatever order you're wanting
  top_loop:
    mov     ebx,8
  inner_loop:
    shl     byte ptr [esi]
    rcl     byte ptr [edi]
    inc     edi
    cmp     edi,edx
    jbe     still_going
    mov     edi,edx
  still_going:
    dec     ebx
    jz      next_part
    loop    inner_loop

  next_part:
    inc     esi
    loop    top_loop

    mov     edi,edx
  ; Right now the data at [edi] will represent
  ; seven{*filter*} values that can be applied to the
  ; lookup table above to determine their total

  ; Use an algorithm similar to this to extract the totals
    mov     ecx,17
  next_lookup:
    movzx   ebx,byte ptr [edi]
    movzx   eax,byte ptr lookup_table[ebx]
  ; Right now, eax contains the total of the bits for this element
  ; Display it, do whatever ...
    inc     edi
    loop    next_lookup

- Rick C. Hodgin



Sun, 06 Jul 2003 11:51:34 GMT  
 Crazy bit-twiddle algorithms needed!

Quote:

> mh... I think that's fairly easy

> input: esi = pointer to bit-stream
>        ecx = number of bits to accumulate
>        edx = distance between bits (=17 if you want to add every 17th
> bit)
>        edi = first bit to test

> output: eax = number of set bits

Hmmm, I did say:

Quote:
> > And of course I want all 17 sums.

You've assumes that the bit pitch is < register size. I think I forgot
to explicitly say that, but that is the case (I intend to port this to
my Alpha, where the pitch will reach 63)

Phil



Sun, 06 Jul 2003 21:11:54 GMT  
 Crazy bit-twiddle algorithms needed!

Quote:
> Hmmm, I did say:
> > > And of course I want all 17 sums.

Mh... apply the algorithm 17 times in a row :)

Quote:
> You've assumes that the bit pitch is < register size

not at all, should work with any given pitch (<2^32)...


Mon, 07 Jul 2003 02:36:01 GMT  
 Crazy bit-twiddle algorithms needed!

Quote:


> > mh... I think that's fairly easy

> > input: esi = pointer to bit-stream
> >        ecx = number of bits to accumulate
> >        edx = distance between bits (=17 if you want to add every 17th
> > bit)
> >        edi = first bit to test

> > output: eax = number of set bits

> Hmmm, I did say:
> > > And of course I want all 17 sums.

> You've assumes that the bit pitch is < register size. I think I forgot
> to explicitly say that, but that is the case (I intend to port this to
> my Alpha, where the pitch will reach 63)

I've had problems with my newsfeed, so I didn't see the original
request, but I'll assume I understood everything correctly from the
context:

This is an interesting problem, I think you could do all sums in
parallel with bitslice arithmetic:

First split the input into blocks of N slices of the bit pitch length (N
= 2^n-1), then add all together using binary full adders:

I.e. starting with 15*17 = 255 bits, you can compress this down to 4
17-bit registers.

A full adder takes three input bits (a, b and carry_in) and generates
two result bits (sum and carry_out).

The logic is simple:

  t = a ^ b;
  sum = t ^ carry_in;
  carry_out = (a & b) | (t & carry_in);

Doing this you can convert three input slices to two, seven slices to
three and 15 slices to 4.

If the pitch is less or equal to half the register width, then we'll
simply repeat it as many times as will fit, and do the final
accumulation at the end.

This process could be repeated until it covered the full input array,
but it is probably much faster to keep the number of temporary variables
at something that will fit in the selected processor.

On x86 I'd use 15 slices which will fit inside the 8 64-bit MMX regs, on
an Alpha you could probably get away with up to 20-24 accumulator regs,
corresponding to 1 to 16 Mbit arrays, but it would be better to make the
input size divisible by the register size, to simplify the alignment
code.

Terje

--

Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"



Mon, 07 Jul 2003 02:36:06 GMT  
 Crazy bit-twiddle algorithms needed!

Quote:

> I've had problems with my newsfeed, so I didn't see the original
> request, but I'll assume I understood everything correctly from the
> context:

> This is an interesting problem, I think you could do all sums in
> parallel with bitslice arithmetic:

> First split the input into blocks of N slices of the bit pitch length (N
> = 2^n-1), then add all together using binary full adders:

> I.e. starting with 15*17 = 255 bits, you can compress this down to 4
> 17-bit registers.

I'm glad you chose that example - the most likely problems I will want
to tackle are summing
up to 192 bits into between 17 and 59 buckets (all pitches are prime
numbers), so we're in a similar ballpark.

If I understand correctly, then you're accumulating the counts
'perpendicularly' across several registers.
This certainly classifies as a 'crazy bit-twiddle', in my book,
congratulations!
I have prior knowledge that the bitstring will be nowhere near full, so
the maximum of 15 in your example will never be attained, and 7 is more
than enough. That decreases the register requirement by 1.

I do need to do some comparison arithmetic on the sums too, I need to
classify them into 3 cases:
sum<=1; sum=2, sum>=3.
Assuming the LSBs are in register R0, then
sum>=3 <=> R2 | (R1&R0)
sum==2 <=> R1 & ~(R1|R2)
sum<=1 <=> ~(R1|R2)
That's fairly neat too.

Quote:
> A full adder takes three input bits (a, b and carry_in) and generates
> two result bits (sum and carry_out).

> The logic is simple:

>   t = a ^ b;
>   sum = t ^ carry_in;
>   carry_out = (a & b) | (t & carry_in);

This is 3 => 2. OK, I have no problem with that.

Quote:
> Doing this you can convert three input slices to two, seven slices to
> three and 15 slices to 4.

I don't see directly how to go from 7 to 3.
I can grok 9=>6=>4 as two 3=>2s, but a fast 7 to 3 evades me (It's
probably a standard technique).
The 7=>3 truth table doesn't appear to be as simple as the 3=>2 one, so
I'm floundering.

Quote:
> If the pitch is less or equal to half the register width, then we'll
> simply repeat it as many times as will fit, and do the final
> accumulation at the end.

Yes, this is a double win as you do fewer 'sums' and each one
accumulated fewer bits,

Quote:
> This process could be repeated until it covered the full input array,
> but it is probably much faster to keep the number of temporary variables
> at something that will fit in the selected processor.

With an O(n!) algorithm (I know, but maths is sometimes like that, at
least it's NP)
across spans that seem to be O(n.log(p)^2) I'll never get a chance to
tackle spans larger than 500 bits (n=30, and I'm currently looking at
n=20 cases, with spans of 200 bits).

Quote:
> On x86 I'd use 15 slices which will fit inside the 8 64-bit MMX regs, on
> an Alpha you could probably get away with up to 20-24 accumulator regs,
> corresponding to 1 to 16 Mbit arrays, but it would be better to make the
> input size divisible by the register size, to simplify the alignment
> code.

There's no register problem on the Alpha certainly (and I rarely code in
assembly on that, I write C and verify that the output is what I would
have written anyway!). I've never programmed MMX code, is it supported
in recent versions of g++/gas (I'm on Linux)?

Phil



Tue, 08 Jul 2003 02:37:26 GMT  
 Crazy bit-twiddle algorithms needed!

Quote:


> > I've had problems with my newsfeed, so I didn't see the original
> > request, but I'll assume I understood everything correctly from the
> > context:

> > This is an interesting problem, I think you could do all sums in
> > parallel with bitslice arithmetic:

> > First split the input into blocks of N slices of the bit pitch length (N
> > = 2^n-1), then add all together using binary full adders:

> > I.e. starting with 15*17 = 255 bits, you can compress this down to 4
> > 17-bit registers.

> I'm glad you chose that example - the most likely problems I will want
> to tackle are summing
> up to 192 bits into between 17 and 59 buckets (all pitches are prime
> numbers), so we're in a similar ballpark.

> If I understand correctly, then you're accumulating the counts
> 'perpendicularly' across several registers.

Yes.

Quote:
> This certainly classifies as a 'crazy bit-twiddle', in my book,
> congratulations!

:-)

Quote:
> I have prior knowledge that the bitstring will be nowhere near full, so
> the maximum of 15 in your example will never be attained, and 7 is more
> than enough. That decreases the register requirement by 1.

This helps a lot!

- Show quoted text -

Quote:

> I do need to do some comparison arithmetic on the sums too, I need to
> classify them into 3 cases:
> sum<=1; sum=2, sum>=3.
> Assuming the LSBs are in register R0, then
> sum>=3 <=> R2 | (R1&R0)
> sum==2 <=> R1 & ~(R1|R2)
> sum<=1 <=> ~(R1|R2)
> That's fairly neat too.

> > A full adder takes three input bits (a, b and carry_in) and generates
> > two result bits (sum and carry_out).

> > The logic is simple:

> >   t = a ^ b;
> >   sum = t ^ carry_in;
> >   carry_out = (a & b) | (t & carry_in);

> This is 3 => 2. OK, I have no problem with that.

> > Doing this you can convert three input slices to two, seven slices to
> > three and 15 slices to 4.

> I don't see directly how to go from 7 to 3.
> I can grok 9=>6=>4 as two 3=>2s, but a fast 7 to 3 evades me (It's
> probably a standard technique).
> The 7=>3 truth table doesn't appear to be as simple as the 3=>2 one, so
> I'm floundering.

You don't generate a direct 7->3 table, instead you do multiple full
adds:

Assume we have a macro with three inputs and 2 outputs:

  (carry_out, sum) = full_add(a, b, carry_in)

The inputs are named as (a, b, c, d, e, f, g), so the code becomes:

  (s1, s0) = full_add(a, b, c);
  (t1, t0) = full_add(d, e, f);
  (u1, sum0) = full_add(s0, t0, g);
  (sum2, sum1) = full_add(s1, t1, u1);

I.e. you need 4 full adds to convert 7 bits to 3, and you'll need 11 to
go from 15 bits to 4.

The pattern is pretty obvious: You need one full_add for each bit
compression.

With a maximum count of 7, and no need to classify more than (>= 3), you
could simply do a few full_adds, and then switch to a different
approach, where you work with carry_save numbers, propagating each new
input block in parallel, using half adders:

 carry_out = a & b;
 sum = a ^ b;

This requires 4 accumulator regs:

 sum0, sum1, sum2 and carry1:

The code to propagate an input array looks like this:

 sum2 |= sum1 & carry1;             // Saturating adds!
 sum1 ^= carry1;
 carry1 = sum0 & input;
 sum0 ^= input;

Quote:
> > On x86 I'd use 15 slices which will fit inside the 8 64-bit MMX regs, on
> > an Alpha you could probably get away with up to 20-24 accumulator regs,
> > corresponding to 1 to 16 Mbit arrays, but it would be better to make the
> > input size divisible by the register size, to simplify the alignment
> > code.

> There's no register problem on the Alpha certainly (and I rarely code in
> assembly on that, I write C and verify that the output is what I would
> have written anyway!). I've never programmed MMX code, is it supported
> in recent versions of g++/gas (I'm on Linux)?

NASM should handle it, that's all you need!

Terje

--

Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"



Tue, 08 Jul 2003 08:55:29 GMT  
 Crazy bit-twiddle algorithms needed!

Quote:
>> > First split the input into blocks of N slices of the bit pitch length (N
>> > = 2^n-1), then add all together using binary full adders:

>> > I.e. starting with 15*17 = 255 bits, you can compress this down to 4
>> > 17-bit registers.

Why add?  If you take the bits as they come in and build a sequence of
bit patterns (buckets he calls them) then you can use those bit
patterns against a lookup table and you'll have your summations.  The
example I posted the other day does this and it works for up to 8-bit
patterns.

- Rick C. Hodgin



Tue, 08 Jul 2003 12:52:27 GMT  
 Crazy bit-twiddle algorithms needed!

Quote:

> >> > First split the input into blocks of N slices of the bit pitch length (N
> >> > = 2^n-1), then add all together using binary full adders:

> >> > I.e. starting with 15*17 = 255 bits, you can compress this down to 4
> >> > 17-bit registers.

> Why add?  If you take the bits as they come in and build a sequence of
> bit patterns (buckets he calls them) then you can use those bit
> patterns against a lookup table and you'll have your summations.  The
> example I posted the other day does this and it works for up to 8-bit
> patterns.

Rick, as I stated I never got the initial post, so I do not have the
complete background. There are obviously many different ways to handle
this problem, depending upon the actual input data.

With a guaranteed max of 7 for each bit position, it might simply be
most efficient to have 3 accumulator regs, and then split each input
into 3 blocks:

 sum0 += input & 0o11111111; // Octal constant!
 sum1 += input & 0o02222222;
 sum2 += input & 0o04444444;

This can handle up to 30 bits in each input block, to get all the way to
32 would require a shift operation on the the two last lines.

Since all bit pitches would be primes, 32 is impossible, so a single
shift is enough to handle the case of 31:

 sum0 += input & 0o11111111;
 sum1 += input & 0o02222222;
 sum2 += (input>>2) & 0o11111111;

This results in 9 basic two-operand opcodes for each new input block.

Using my half adders from yesterday:

 sum2 |= sum1 & carry1;         // Saturating adds!
 sum1 ^= carry1;
 carry1 = sum0 & input;
 sum0 ^= input;

I get just 7 ops for each new input block, while at the same time
automatically handling any overflows in the individual counts.

Terje

--

Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"



Tue, 08 Jul 2003 22:13:25 GMT  
 Crazy bit-twiddle algorithms needed!

Quote:
>With a guaranteed max of 7 for each bit position, it might simply be
>most efficient to have 3 accumulator regs, and then split each input
>into 3 blocks:
> sum0 += input & 0o11111111; // Octal constant!
> sum1 += input & 0o02222222;
> sum2 += input & 0o04444444;

I either don't understand your solution (more likely) or I don't
understand the original problem.

The original poster said he had a fixed number of bits in a source
string and he needed to sum every Nth bit, such as each 17th bit being
equal to string[0] + string[17] + string[37].... + string[whatever the
particular iteration would still fit within the maximum length of the
original string].

How does your solution do that?

My solution is to cycle through the source string in an iterative
manner building a series of output bytes/words/dwords (depending on
maximum output size) containing the Nth order bit patterns as they
originally appeared in the source string.  Once that series is built
each element can be applied to a simple 256-byte lookup table which
would provide you the sums.  One lookup would be required for 8-bit or
less quantities, 2 lookups and one add would be required for 16-bit to
9-bit quantities, 3 lookups and two adds would be required for 24-bit
to 17-bit operations, etc.  In the alternative larger lookup table
could be provided to reduce the number of adds.

- Rick C. Hodgin



Wed, 09 Jul 2003 01:37:29 GMT  
 Crazy bit-twiddle algorithms needed!

Quote:

> >With a guaranteed max of 7 for each bit position, it might simply be
> >most efficient to have 3 accumulator regs, and then split each input
> >into 3 blocks:
> > sum0 += input & 0o11111111; // Octal constant!
> > sum1 += input & 0o02222222;
> > sum2 += input & 0o04444444;

> I either don't understand your solution (more likely) or I don't
> understand the original problem.

> The original poster said he had a fixed number of bits in a source
> string and he needed to sum every Nth bit, such as each 17th bit being
> equal to string[0] + string[17] + string[37].... + string[whatever the
> particular iteration would still fit within the maximum length of the
> original string].

> How does your solution do that?

> My solution is to cycle through the source string in an iterative
> manner building a series of output bytes/words/dwords (depending on
> maximum output size) containing the Nth order bit patterns as they
> originally appeared in the source string.  Once that series is built
> each element can be applied to a simple 256-byte lookup table which
> would provide you the sums.  One lookup would be required for 8-bit or
> less quantities, 2 lookups and one add would be required for 16-bit to
> 9-bit quantities, 3 lookups and two adds would be required for 24-bit
> to 17-bit operations, etc.  In the alternative larger lookup table
> could be provided to reduce the number of adds.

OK, I understand!

You are basically compressing the bit string, extracting only the wanted
bits, and then counting the number of set bits in the result. Right?

My code is quite different, I calculate all 17 sums for a pitch 17
calculation in parallel.

I.e. first extract each 17-bit slice, align into the bottom part of a
register, and then accumulate each bit position simultaneously:

  input = get17bits();
  sum2 |= sum1 & carry1;
  sum1 ^= carry1;
  carry1 = sum0 & input;
  sum0 ^= input;

If you want to avoid the explicit storage of carry, then the code gets
some more serial latency:

  input = get17bits();
  carry = sum0 & input;
  sum0 ^= input;
  sum2 |= sum1 & carry;
  sum1 ^= carry;

Terje

--

Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"



Wed, 09 Jul 2003 05:17:02 GMT  
 Crazy bit-twiddle algorithms needed!

Quote:
>string[0] + string[17] + string[37]....

That should be [34], not [37].  Typo.  Same finger, different hand. :)

- Rick C. Hodgin



Wed, 09 Jul 2003 05:17:07 GMT  
 
 [ 28 post ]  Go to page: [1] [2]

 Relevant Pages 

1. Bit Twiddling??

2. Bit twiddling on Parallel port

3. UUDECODE - Bit-twiddling in Arexx

4. SIMD-like bit-twiddling on odd-sized quantities

5. New Mathematical bit-twiddle - stage on of a isPerfectSquare()

6. Bit twiddling functions

7. Bit twiddling functions

8. Byte/Bit Twiddling in Ada

9. Bit Twiddling in IBM (LE) COBOL

10. help - bit twiddling anomoly

11. Twiddle binary bit in Cobol?

12. Bit Twiddling

 

 
Powered by phpBB® Forum Software