Eliminating Branch Instructions in Assembly 
Author Message
 Eliminating Branch Instructions in Assembly

I have discussed optimization of assembly in last two
mails -- "Optimized "DOCOL" in Win32For on Pentium"
and "Some Reordered Assembly for Win32For".
This mail is about eliminating branch instructions.
Branch instructions may cause slow codes.

First I give an example in "Fkernel.f"

CODE 0>         ( n1 -- f1 )    \ return true if n1 is greater than zero
        cmp     ebx, # 0

                mov             ebx, # -1

This definition is not bad but uses two branch instructions.
I explain how Pentium and Pentium Pro execute these lines.
When they decode the first two instructions, they will
find the second is a branch instruction.  Instead of waiting
the branch taken or not, they predict that the branch will
be taken or not.  If their prediction is right, only 1 or 2 cycles
are consumed; if not, Pentium will need extra 3 cycles to
execute the right sequence of instructions, but Pentium Pro
will need extra 10 CYCLES AT LEAST !!!  If they almost
predict rightly, the definition is not bad.  But the first branch
is taken if f1 is false, and it is not taken if f1 is true.  That is,
the processors must predict whether f1 is true or not.  I think
that it is hard to predict in most occasion, since programmers
do not take care of this.  So the definition is not good enough.
Moreover, the processors have an internal buffer called BTB
for storing the branch target addresses.  Pentium's BTB is
the size of only 256 branches, and Pentium Pro's the size of
only 512 branches.  Hence programmers had better not use
branch instructions.  I wrote a verson of "0>" without any
branch instructions.

CODE 0>         ( n1 -- f1 )
\ *S eliminate jump instructions
\ return true if n1 is greater than zero
                dec     ebx
                cmp             ebx, # 0x7FFFFFFF
                sbb             ebx, ebx
                next            c;

It consumes only 8 cycles on Pentium.  If I reoder it,
it will consume only 6 cycles.  ( see the two mails. )
Here is another example.

CODE <          ( n1 n2 -- f1 ) \ return true if n1 is less than n2
                pop             eax
                cmp             eax, ebx

                mov             ebx, # -1

First I eliminate a branch.

CODE <          ( n1 n2 -- f1 )
\ *S eliminate a jump instruction
\ return true if n1 is less than n2
                pop     eax
                cmp             eax, ebx
                mov             ebx, # 0

                dec             ebx

Now I eliminate all branches.

CODE <        ( n1 n2 -- f1 )
\ *S eliminate ALL jump instructions !!
\ not bad on Pentium, not best on Pentium Pro
\ return true if n1 is less than n2
                pop             eax
                cmp             eax, ebx
                sbb             ecx, ecx
                xor             ebx, eax
                sar             ebx, # 31
                xor             ebx, ecx
                next            c;

It consumes more cycles to execute the "sbb" instruction
on Pentium Pro.

\ CODE <        ( n1 n2 -- f1 )
\ *S eliminate ALL jump instructions !!
\ worse on Pentium, best on Pentium Pro
\ return true if n1 is less than n2
        pop             eax
                xor             ecx, ecx
                cmp             eax, ebx
                setl            cl
                neg             ecx
                mov             ebx, ecx
                next            c;

It consumes more cycles to exeute the "setl" instruction
on Pentium.

I think if we use assembly, we must take care of branch
instructions.  Do not use them as possible as we can.
If we have to use them, we should use them carefully.
As we ignore this, we may get somewhat slow codes.

Surprised?  I will continue the topic about branch
instructions.  TO BE CONTINUED.

                        Charles Liu



Sat, 13 Nov 1999 03:00:00 GMT  
 Eliminating Branch Instructions in Assembly

I agree with Mr. Liu. It is generally much better to compute
rather than to decide. The longer the instruction queue or pre-
fetch pipeline, the more likely it is that decisionless programming
will improve operating speed.

HOWEVER:
Some processors (I think the Power PC chipset which was, if
my aging memory does not fail me, based on the IBM RISC-6000
design) have double pre-fetch queues: one for each branch.
On such processors the penalty for jumps and branches is
small, and the maintainability of the code is better (because
it is clearer--no clever hacks are needed) so one might
as well keep the branches.

Some of this was discussed in my article "Avoid Decisions"
which appeared in Computers in Physics, 1991.

ADDITIONAL REMARKS:
One of the best hacks for programming such things as
comparison operators uses the sign-extend instructions,
which I did not see exploited by Mr. Liu.

That is, if one does a CMP or a SUB, thereby setting the
sign flag, one can extend this flag through an 8-, 16-
or 32-bit register (I believe the mnemonic is CSB for
8-bits and CSW for 16- or 32--or maybe it's CDB and CDW).
If the flag was 0 then the register contains 0 bits, if it was
1 the register contains all 1's. Then one does some logical
arithmetic and voila! one leaves the appropriate flag in
the appropriate place. The words ABS, MAX and MIN are
easily CODEd using this technique, which definitely saves
some cycles.

FINALLY:
Depending on the Forth linkage mechanism, there might be enough
calling overhead that little is gained from CODE optimization at the
(short) word level. Systems with in-line code optimizers or those that
generate native code as they compile will of course be expected
to gain much more from such techniques than vanilla high-level
Forths with indirect threading.

It would be interesting if Mr. Liu gave us some timing examples
for typical programs with- and without optimizations of the various
comparison operators, and specifying the cpu and clock speed.

--
Julian V. Noble



Sun, 14 Nov 1999 03:00:00 GMT  
 Eliminating Branch Instructions in Assembly

...

Quote:
> HOWEVER:
> Some processors (I think the Power PC chipset which was, if
> my aging memory does not fail me, based on the IBM RISC-6000
> design) have double pre-fetch queues: one for each branch.
> On such processors the penalty for jumps and branches is
> small, and the maintainability of the code is better (because
> it is clearer--no clever hacks are needed) so one might
> as well keep the branches.

It's been a while since I took the PowerPC class, but I think that there
is some branch prediction, and a penalty for guessing wrong.

IBM published a book called "The PowerPC Compiler Writer's Guide which
has an appendix showing some branchless optimal sequences for some
common things like 0<, 0=, >, < etc.

Note the title. Compiler Writers Guide. This speaks to the point made by
several people---- Optimize when it matters.

...

Quote:
> ADDITIONAL REMARKS:
> One of the best hacks for programming such things as
> comparison operators uses the sign-extend instructions,
...
> The words ABS, MAX and MIN are
> easily CODEd using this technique,

The words are CBW and CWD (Convert-Byte-toWord, Word to Doubleword)
This works pretty well for 0< as well. Just do CWD to the reg that has
the number to be tested, and the answer is now in the high order word.

However, the PowerPC does not have a sign extension instruction.
But you can do a 31 bit arithmetic right shift. 0< in one instruction!
(I confess, I got that from the book :)

Randy Leberknight
Motorola Computer Group
Tempe, AZ.

(602) 438-3275



Sun, 14 Nov 1999 03:00:00 GMT  
 
 [ 3 post ] 

 Relevant Pages 

1. Multi-branch instructions and unique values (Version 2.2 preview)

2. QUESTION: 486 instruction prefetch and branch/jump

3. branch never instruction

4. QUESTION: 486 instruction prefetch and branch/jump

5. Branch frequencies (was Assembly or ....)

6. Branch frequencies (was Assembly or ....)

7. Mono / Multi instructions assembly lines

8. Wanted instruction set for intel 80386/80486/8087/80487/pentium assembly set

9. Assembly FPU instructions

10. Please help with some assembly instructions

11. Inline Assembly: The RET Instruction

12. List of x86 assembly instructions

 

 
Powered by phpBB® Forum Software