jumpless conditional subtraction. 
Author Message
 jumpless conditional subtraction.

As this is a.l.a I'll prefix with a "I'm on an Athlon".

I've got 2 64-bit (never above 2^40, for reference) integers,
and in C, I'm doing:

   /* multiply tpow (in 1..p-1) by 2, modulo p */
   tpow<<=1;
   if(tpow>p) { tpow-=p; }

Now the compiler turns the shift into a shl and a shld, which I don't have
an issue with. However, the if statement is rendered using cmps and jumps,
which, as the value is utterly unpredictable, causes me to take a bit of a
performance hit.

What is the suggested 64-bit branchless bit-twiddle to do the above?
I know the 32-bit equivalent is some funky combination of subs, sbbs, ands,
and other stuff, does it convert to 64-bits? Is there a cmovcc version even?

Phil



Sat, 02 Jul 2005 10:49:10 GMT  
 jumpless conditional subtraction.

Quote:

> As this is a.l.a I'll prefix with a "I'm on an Athlon".

> I've got 2 64-bit (never above 2^40, for reference) integers,
> and in C, I'm doing:

>    /* multiply tpow (in 1..p-1) by 2, modulo p */
>    tpow<<=1;
>    if(tpow>p) { tpow-=p; }

> Now the compiler turns the shift into a shl and a shld, which I don't have
> an issue with. However, the if statement is rendered using cmps and jumps,
> which, as the value is utterly unpredictable, causes me to take a bit of a
> performance hit.

> What is the suggested 64-bit branchless bit-twiddle to do the above?
> I know the 32-bit equivalent is some funky combination of subs, sbbs,
ands,
> and other stuff, does it convert to 64-bits? Is there a cmovcc version

even?

Shouldn't this stuff be in 'comp.lang.c' instead of alt.lang.asm?

Ross.



Sat, 02 Jul 2005 13:14:21 GMT  
 jumpless conditional subtraction.


Quote:
> As this is a.l.a I'll prefix with a "I'm on an Athlon".

> I've got 2 64-bit (never above 2^40, for reference) integers,
> and in C, I'm doing:

>    /* multiply tpow (in 1..p-1) by 2, modulo p */
>    tpow<<=1;
>    if(tpow>p) { tpow-=p; }

> Now the compiler turns the shift into a shl and a shld, which I don't have
> an issue with. However, the if statement is rendered using cmps and jumps,
> which, as the value is utterly unpredictable, causes me to take a bit of a
> performance hit.

Perhaps instead of a condition, you could use something like:

        tpow -= p & -(tpow>p);

Though the compiler might use a couple of cmps and a jump to implement
the comparison of 64-bit items.  If so, you can probably improve it a
bit by using sub and sbb instead of cmp.  Since you only need all bits
either set or clear in the result, you can also produce just a 32-bit
result, and and that 32-bit item with both halves of the 64-bit operand
to get the 0 or p that you're going to subtract from tpow.

--
    Later,
    Jerry.

The universe is a figment of its own imagination.



Sat, 02 Jul 2005 15:11:03 GMT  
 jumpless conditional subtraction.
Try this:

    push ebp
    mov ebp,esp
    sub esp,8
    mov eax,[tpow]            ; get low dword of tpow
    mov edx,[tpow+4]          ; get high dword of tpow
    shl eax, 1                ; places high order bit into carry flag
    rcl edx, 1                ; places carry flag into low order bit
    mov [tpow+0],eax          ; store low dword of tpow
    mov [tpow+4],edx          ; store high dword of tpow
    sub eax,[p]               ; compare low dword
    sbb edx,[p+4]             ; compare high dword
    mov eax,[p+0]             ; reload low dword of tpow
    mov edx,[p+4]             ; reload high dword of tpow
    cmova [tpow+0],eax        ; if (tpow > p) get low dword of p
    cmova [tpow+4],edx        ; if (tpow > p) get high dword of p
    leave


Quote:
> As this is a.l.a I'll prefix with a "I'm on an Athlon".

> I've got 2 64-bit (never above 2^40, for reference) integers,
> and in C, I'm doing:

>    /* multiply tpow (in 1..p-1) by 2, modulo p */
>    tpow<<=1;
>    if(tpow>p) { tpow-=p; }

> Now the compiler turns the shift into a shl and a shld, which I don't have
> an issue with. However, the if statement is rendered using cmps and jumps,
> which, as the value is utterly unpredictable, causes me to take a bit of a
> performance hit.

> What is the suggested 64-bit branchless bit-twiddle to do the above?
> I know the 32-bit equivalent is some funky combination of subs, sbbs,
ands,
> and other stuff, does it convert to 64-bits? Is there a cmovcc version
even?

> Phil



Sat, 02 Jul 2005 14:02:59 GMT  
 jumpless conditional subtraction.


Quote:
> As this is a.l.a I'll prefix with a "I'm on an Athlon".

> I've got 2 64-bit (never above 2^40, for reference) integers,
> and in C, I'm doing:

>    /* multiply tpow (in 1..p-1) by 2, modulo p */
>    tpow<<=1;
>    if(tpow>p) { tpow-=p; }

> Now the compiler turns the shift into a shl and a shld, which I don't have
> an issue with. However, the if statement is rendered using cmps and jumps,
> which, as the value is utterly unpredictable, causes me to take a bit of a
> performance hit.

> What is the suggested 64-bit branchless bit-twiddle to do the above?

Well, it sounds like you want something like:

    tpow <<= 1;
    tpow -= p & -(tpow > p);

Now, so how do we do -(tpow > p) in a totally predicate manner?  Lets see this
is the same as -(p - tpow < 0), so we want to perform a 64 bit subtract of tpow
from p and just capture the carry, then use it as the mask and it against p
then subtract from tpow.

    mov esi, tpow[0]
    mov edi, tpow[4]
    add esi, esi
    adc edi, edi ; tpow << 1; -- Don't use SHLD.

    mov eax, p[0]
    mov edx, p[4]
    sub eax, esi
    sbb edx, edi ;   p - tpow
    sbb ebx, ebx ; -(p - tpow < 0)

    mov eax, negp[0]
    mov edx, negp[4]
    and eax, ebx
    and edx, ebx ; p & -(p - tpow < 0)

    sub esi, eax
    sbb edi, edx ; tpow - (p & -(p - tpow < 0))

    mov tpow[0], esi
    mov tpow[4], edi ; tpow -= p & -(p - tpow < 0)

Voila!  Ok, I cmovCC should be even easier.  Your code is equivalent to:

    tpow <<= 1;
    tpow = (tpow - p >= 0) ? tpow - p : tpow;

So we will just compute tpow - p, capture the CF flag, then conditionally
overwrite tpow

    mov esi, tpow[0]
    mov edi, tpow[4]
    add esi, esi
    adc edi, edi ; tpow << 1; -- Don't use SHLD.

    mov eax, esi
    mov edx, edi ; Annoying dependency.

    sub eax, p[0]
    sbb edx, p[4] ; tpow - p

    cmovnb esi, eax
    cmovnb edi, edx ; (tpow - p >= 0) ? tpow - p : tpow

    mov tpow[0], esi
    mov tpow[4], edi ; tpow = (tpow - p >= 0) ? tpow - p : tpow

--
Paul Hsieh
http://www.pobox.com/~qed/



Sat, 02 Jul 2005 18:47:03 GMT  
 jumpless conditional subtraction.

Quote:
>    /* multiply tpow (in 1..p-1) by 2, modulo p */
>    tpow<<=1;
>    if(tpow>p) { tpow-=p; }

> What is the suggested 64-bit branchless bit-twiddle to do the above?

Since you're dealing with > 32-bit operands and the 386+ is only a 32-bit
operand computer, you're going to have to have a branch somewhere after
comparing portions of the upper 8-bits if it's greater.  If you do not then
the generated code will complex enough that the 12+ cycle hit you're taking
right now due to branch will seem small.

What is the generated code for Athlon?

Quote:
> I know the 32-bit equivalent is some funky combination of subs, sbbs,
ands,
> and other stuff, does it convert to 64-bits? Is there a cmovcc version

even?

The problem isn't getting the bits setup for the cmovcc you mention.  The
problem is traversing through the required code necessary to test all
conditions without a branch in such a way that doesn't harm any code or have
any undesired side-effects.

I will work on this today however.  I find it to be an interesting
challenge.

- Rick C. Hodgin



Sat, 02 Jul 2005 21:24:10 GMT  
 jumpless conditional subtraction.

Quote:


>> As this is a.l.a I'll prefix with a "I'm on an Athlon".

>> I've got 2 64-bit (never above 2^40, for reference) integers,
>> and in C, I'm doing:

>>    /* multiply tpow (in 1..p-1) by 2, modulo p */
>>    tpow<<=1;
>>    if(tpow>p) { tpow-=p; }

>> Now the compiler turns the shift into a shl and a shld, which I don't have
>> an issue with. However, the if statement is rendered using cmps and jumps,
>> which, as the value is utterly unpredictable, causes me to take a bit of a
>> performance hit.

>> What is the suggested 64-bit branchless bit-twiddle to do the above?
>> I know the 32-bit equivalent is some funky combination of subs, sbbs,
> ands,
>> and other stuff, does it convert to 64-bits? Is there a cmovcc version
> even?

> Shouldn't this stuff be in 'comp.lang.c' instead of alt.lang.asm?

No. I want to know how to do better than the compiler.
The most common way to do better than the C compiler is to use asm, isn't
it?

Quite how does one do a "sbb" or a "cmovcc" in C? Please enlighten me.

Phil



Sun, 03 Jul 2005 05:44:31 GMT  
 jumpless conditional subtraction.


Quote:

> > Shouldn't this stuff be in 'comp.lang.c' instead of alt.lang.asm?
> No. I want to know how to do better than the compiler.
> The most common way to do better than the C compiler is to use asm, isn't
> it?

Not neccassarily.  A good modern C/C++ compiler can out do
some of the most seasoned asm coders.  :-)

Quote:
> Quite how does one do a "sbb" or a "cmovcc" in C? Please enlighten me.

Depends on the Compiler.

_asm {
  sbb eax,0

Quote:
}

or

__asm sbb eax,0

or

#asm
  sbb eax,0
#endasm

Many others.  Depends on the compiler and the
platform you plan to compile for.

Cheers,
Ben

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Forever Young Software
http://www.{*filter*}trails.com/~fys/index.htm
To reply by email, please remove the zzzzzz's

Batteries not included,
Some assembly required.



Sun, 03 Jul 2005 06:06:25 GMT  
 jumpless conditional subtraction.

Quote:



>> As this is a.l.a I'll prefix with a "I'm on an Athlon".

>> I've got 2 64-bit (never above 2^40, for reference) integers,
>> and in C, I'm doing:

>>    /* multiply tpow (in 1..p-1) by 2, modulo p */
>>    tpow<<=1;
>>    if(tpow>p) { tpow-=p; }

[SNIP - evolution of technique]

Quote:
>     tpow <<= 1;
>     tpow = (tpow - p >= 0) ? tpow - p : tpow;

> So we will just compute tpow - p, capture the CF flag, then conditionally
> overwrite tpow

>     mov esi, tpow[0]
>     mov edi, tpow[4]
>     add esi, esi
>     adc edi, edi ; tpow << 1; -- Don't use SHLD.

>     mov eax, esi
>     mov edx, edi ; Annoying dependency.

>     sub eax, p[0]
>     sbb edx, p[4] ; tpow - p

>     cmovnb esi, eax
>     cmovnb edi, edx ; (tpow - p >= 0) ? tpow - p : tpow

>     mov tpow[0], esi
>     mov tpow[4], edi ; tpow = (tpow - p >= 0) ? tpow - p : tpow

Now that is truly elegant. You've pared it down to exactly what is needed,
and nothing which isn't.

Ben, apart from the shl/rcl vs add/adc has the same instruction count, and
I think was aiming for the same solution.

There's an interesting analogue problem, that of dividing by 2. In this
case, the adjustment that you make is dependent on the very bottom bit of
tpow

  tpow = (tpow + ((tpow&1) ? p : 0)) >> 1 ;

In this case you don't need to perform 2 operations in order to arrive at
the 'I need to add p' decision, as it's all decided from that bottom bit.
-(tpow&1) is 0 (tpow even) or ~0 (tpow odd), and so I think my first stab
would be the following, which to be honest looks rubbish!

    mov esi, tpow[0]

    mov eax, esi

    and eax, 1

    neg eax          ; all 1s if tpow was odd, else all 0s

    mov ebx, eax

    and eax, p[0]   ; p or 0
    and ebx, p[1]   ;
    add eax, esi
    adc ebx, tpow[1]
    shr eax, 1
    rcr ebx, 1      ; shrd posible, if done before the eax shr
    mov tpow[0], eax
    mov tpow[1], ebx

It's evident that the dependencies at the start are crap.
Let's start again...

    mov    esi, tpow[0]
    xor    ebx, ebx
    test   esi, 1        ; test oddness of tpow
    mov    eax, ebx
    cmovnz eax, p[0]    
    cmovnz ebx, p[1]     ; ebx:eax = odd ? p : 0
    add    eax, esi
    adc    ebx, tpow[1]  ; ebx:eax = tpow + (odd?p:0)
    shr    eax, 1
    rcr    ebx, 1        ; shrd posible, if done before the eax shr
    mov    tpow[0], eax
    mov    tpow[1], ebx

That looks better, but no doubt someone can do better.

I've not decided whether I shall do the modular *=2 or the /=2 version yet,
I just tested /=2 in a 32-bit version, and the compiler makes /=2 far better
than *=2. However, as this is at the innermost loop, I'm definitely looking
at asm-ifying it for x86.

Thanks for your input,

Phil



Sun, 03 Jul 2005 06:40:01 GMT  
 jumpless conditional subtraction.

Quote:

>> No. I want to know how to do better than the compiler. The most common
>> way to do better than the C compiler is to use asm, isn't it?

> Not neccassarily.  A good modern C/C++ compiler can out do some of the
> most seasoned asm coders.  :-)

On the right architecture, sometimes. I write alpha assembly in C because
I know that gcc will turn one C statement into one asm instruction. I've
never needed 27 Int registers and 31 FP registers, so I don't feel
restricted by this method. Gcc does the same thing with the same code on
the Power4 architecture, which means that I've got instruction by
instruction control on at least 2 architectures. I shall be compiling my
latest code on R12000 and SparcIII machines this week, and I'll see if gcc
does an equivalent job threon too.

x86, however, I find it very hard to agree with you. I've seen some
amazing wizardry by others on x86 machines, and I'm almost always
dissatisfied by what gcc produces on x86. Intel's icc only seems to be
margially better, but I always inspect the generated code, and I always
ask myself "why is it so stoopid".

Quote:
>> Quite how does one do a "sbb" or a "cmovcc" in C? Please enlighten me.

> Depends on the Compiler.

> _asm {
>   sbb eax,0
> }

Not C.

Quote:
> or

> __asm sbb eax,0

Not C.

Quote:
> or

> #asm
>   sbb eax,0
> #endasm

Not C.

Quote:
> Many others.  Depends on the compiler and the platform you plan to
> compile for.

But that's not C,; I'd call it (non-standard) inline assembly. It
(well, the standard version) is the techinque I intend to use to get asm
into the essential parts of my C code where gcc has done such a poor job.

Phil



Sun, 03 Jul 2005 07:02:04 GMT  
 jumpless conditional subtraction.

Quote:

> > Shouldn't this stuff be in 'comp.lang.c' instead of alt.lang.asm?

> No. I want to know how to do better than the compiler.
> The most common way to do better than the C compiler is to use asm, isn't
> it?

Ah okay, Sorry!

Quote:
> Quite how does one do a "sbb" or a "cmovcc" in C? Please enlighten me.

Inline Assembly?

Regards,
Ross.



Sun, 03 Jul 2005 07:19:29 GMT  
 jumpless conditional subtraction.


Hi Phil,

Quote:
> >> Quite how does one do a "sbb" or a "cmovcc" in C? Please enlighten me.

> > Depends on the Compiler.

> > _asm {
> >   sbb eax,0
> > }

> Not C.

> > or

> > __asm sbb eax,0

> Not C.

> > or

> > #asm
> >   sbb eax,0
> > #endasm

> Not C.

> > Many others.  Depends on the compiler and the platform you plan to
> > compile for.

> But that's not C,; I'd call it (non-standard) inline assembly. It
> (well, the standard version) is the techinque I intend to use to get asm
> into the essential parts of my C code where gcc has done such a poor job.

Uhhmm.  Don't take this offensive by any means.  I don't mean
it offensive, but you couldn't have asked a more general question.
Why expect a detailed answer with such a general question.

 "Quite how does one do a "sbb" or a "cmovcc" in C?"

In my opinion, placing assembly instructions, (inline, emit, whatever)
is no longer C in the first place.  How can I put "assembly language"
instructions in a "C language" source file and still be C?

Maybe I am just not following you here.

Cheers :-),
Ben

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Forever Young Software
http://www.{*filter*}trails.com/~fys/index.htm
To reply by email, please remove the zzzzzz's

Batteries not included,
Some assembly required.



Sun, 03 Jul 2005 12:38:36 GMT  
 jumpless conditional subtraction.

Quote:

> Uhhmm.  Don't take this offensive by any means.  I don't mean
> it offensive, but you couldn't have asked a more general question.
> Why expect a detailed answer with such a general question.

>  "Quite how does one do a "sbb" or a "cmovcc" in C?"

I didn't expect a detailed answer. I expected "inline assembly" and nothing
more. Hence my choice of newsgroup. Details are what I had hoped for in my
post previous to that, and fortunately I got them!

Phil



Sun, 03 Jul 2005 16:33:12 GMT  
 jumpless conditional subtraction.
OK, so the code is for tpow = min(tpow << 1,p).
It seems that you do want a proper MOD, so remove
the following two lines:

Quote:
>     mov eax,[p+0]             ; reload low dword of tpow
>     mov edx,[p+4]             ; reload high dword of tpow

and replace:

Quote:
>     cmova [tpow+0],eax        ; if (tpow > p) get low dword of p
>     cmova [tpow+4],edx        ; if (tpow > p) get high dword of p

with:

Quote:
>     cmovnb [tpow+0],eax        ; if (tpow > p) get low dword of p
>     cmovnb [tpow+4],edx        ; if (tpow > p) get high dword of p

also, remove the stack frame (we don't need it):

Quote:
>     push ebp
>     mov ebp,esp
>     sub esp,8
>     leave

this makes:

Quote:
>     mov eax,[tpow]            ; get low dword of tpow
>     mov edx,[tpow+4]          ; get high dword of tpow
>     shl eax, 1                ; places high order bit into carry flag
>     rcl edx, 1                ; places carry flag into low order bit
>     mov [tpow+0],eax          ; store low dword of tpow
>     mov [tpow+4],edx          ; store high dword of tpow
>     sub eax,[p]               ; compare low dword
>     sbb edx,[p+4]             ; compare high dword
>     cmovnb [tpow+0],eax        ; if (tpow > p) get low dword of p
>     cmovnb [tpow+4],edx        ; if (tpow > p) get high dword of p

This is no faster or slower than Paul Hsieh's solution. In fact, it's almost
exactly the same.


Sun, 03 Jul 2005 08:56:15 GMT  
 
 [ 14 post ] 

 Relevant Pages 

1. Double word subtraction

2. 5.2e memvar subtraction problem

3. LOGO-L> Subtraction survey

4. LOGO-L> Subtraction survey results

5. Is object subtraction possible?

6. Subtraction in VHDL (Help!!!!)

7. 32bit x 32 bit add and subtraction

8. Addition and subtraction operators and std_logic_vector types

9. synthesize subtraction

10. 1's complement addition, subtraction help

11. 8-bit Subtraction

12. Two's complement signed binary addition and subtraction

 

 
Powered by phpBB® Forum Software