Optimized "DOCOL" in Win32For on Pentium 
Author Message
 Optimized "DOCOL" in Win32For on Pentium

I use Win32For on Pentium now.  I found "DOCOL" in "Fkernel.f"
not good enough.  Because Pentium is a superscalar processor
and has AGI as i486.  So we must program Pentium very carefully.
First I explain what AGI is ( Address Generation Interlock ).
Look at the following two lines, which is extracted from "Fkernel.f"

sub     ebp, # 4                \ 1 cycle
mov     [ebp], esi              \ 1 cycle + 1 cycle AGI stall
                        \ total 3 cycle on both i486 and Pentium

Because the first line "sub     ebp, # 4" writes ebp, which is used
by the second line "mov     [ebp], esi" to generate address, AGI
is generated and cause 1 cycle stall.  The former two lines should be
replaced with the following:

mov     -4 [ebp], esi   \ U pipeline
sub     ebp, # 4                \ V pipeline
                        \ total 1 cycle on Pentium, 2 cycles on i486

I explain what U-V pipline is.  Pentium has two integer execution unit
called U-V pipeline and is able to execute two simple instructions
simultaneously.  The first is U and the second V; some instructions
must be executed in U other in V.  Unfortunately, it is unable to arrange
instructions to fit the U-V pipeline.  So programmers have to program it
very carefully.  Now I explain the latter two lines.  Pentium execute them
in U-V pipeline parallelly.  First, in U pipeline, it fetches ebp added
with -4 to generate destination address and esi as source; in V pipeline,
it fetches ebp and then subtracts 4 from ebp.  Later, in U pipeline,
it stores esi to memory; in V pipeline, it writes ebp with the result.  
Hence, it can execute the two instructions simultaneously and consumes
only 1 cycle without AGI.  The latter is faster and not longer.
( [ebp] is a short form of 0 [ebp] )

I analyse the original assembly for "DOCOL".

LABEL DOCOL     ( -- )  \ runtime for colon definitions
                mov     -4 [ebp], esi   \ U 1
                lea     esi, 8 [eax] [edi]      \ V    
                sub     ebp, # 4                \ U 1
                mov     eax, -4 [esi]   \ V 1 AGI caused by lea esi,...
                mov     ecx, 0 [eax] [edi]      \ 2 AGI caused by mov eax,...
                add     ecx, edi                \ 1
                jmp     ecx             \ 2
                c;                      \ total 8 cycles on Pentium

I skip the non-optimized version.  The two lines "mov ecx, 0 [eax] [edi]"
and "add ecx, edi" cannot be executed in parallel because "mov ecx,..."
writes ecx and then "add ecx,..." adds ecx with edi.  Since the first
instruction writes ecx, which the second reads, Pentium must execute
the first and then the second and cannot execute them in parallel.  
This case is called "read after write".  The former version of "DOCOL"
can be replaced with the following:

LABEL DOCOL     ( -- )  \ runtime for colon definitions
                mov     -4 [ebp], esi   \ U 1
                lea     esi, 8 [eax] [edi]      \ V    
                mov     eax, 4 [eax] [edi]      \ U 1    
        sub     ebp, # 4                \ V
                mov     ecx, 0 [eax] [edi]      \ 2 AGI caused by mov eax,...
                add     ecx, edi                \ 1
                jmp     ecx             \ 2
                c;                      \ total 7 cycles on Pentium

The new version eliminates an AGI and runs 1 cycle faster.
Believe it or not.  I can write a version of "DOCOL", which consumes
only 6 cycles, and eliminate all AGIs.

LABEL DOCOL     ( -- )  \ runtime for colon definitions
                mov     ecx, 4 [eax] [edi]      \ U 1    
        mov     -4 [ebp], esi   \ V
                lea     esi, 8 [eax] [edi]      \ U 1  
                mov     eax, ecx                \ V
        mov     ecx, 0 [ecx] [edi]      \ U 1
        sub     ebp, # 4                \ V
                add     ecx, edi                \ 1
                jmp     ecx             \ 2
                c;                      \ total 6 cycles on Pentium but 1 cycle
                                \ slower than the second version
                                \ on i486 because of one more instruction

I load ecx in advance and add an instruction "mov eax, ecx".
So I can eliminate all AGIs and speed up the code.
Surprised!!  I have other optimized codes for running on Pentium.
I will post to this newsgroup.  TO BE CONTINUED.

                                Charles Liu



Wed, 10 Nov 1999 03:00:00 GMT  
 Optimized "DOCOL" in Win32For on Pentium


Quote:
> I use Win32For on Pentium now.  I found "DOCOL" in "Fkernel.f"
> not good enough.  Because Pentium is a superscalar processor
> and has AGI as i486.  So we must program Pentium very carefully.
> First I explain what AGI is ( Address Generation Interlock ).
> Look at the following two lines, which is extracted from "Fkernel.f"
> sub     ebp, # 4        \ 1 cycle
> mov     [ebp], esi      \ 1 cycle + 1 cycle AGI stall
>                         \ total 3 cycle on both i486 and Pentium
> Because the first line "sub     ebp, # 4" writes ebp, which is used
> by the second line "mov     [ebp], esi" to generate address, AGI
> is generated and cause 1 cycle stall.  The former two lines should be
> replaced with the following:
> mov     -4 [ebp], esi   \ U pipeline
> sub     ebp, # 4        \ V pipeline
>                         \ total 1 cycle on Pentium, 2 cycles on i486

Charles,

I really like reading posts like this, especially because you have taken the
time to explain why the idea works. However, this particular sequence has
shown up three times in the past month :-)

I'm still not ready to use it though, because of what could happen if an
interrupt occurs between the mov and the sub, or when I'd want to use a
timer to switch Forth tasks? Some of my own code switches ebp and esp
sometimes to push and pop to the Forth return stack (so ebp points to the
machine stack for a while). Your trick combined with an interrupt or
task-switch might produce amusing results. Or is it guaranteed that no
interrupt will be serviced in certain states of the U - V pipelines? Does
the Pentium auto-switch stacks when an interrupt occurs?

[..]

Quote:
> I load ecx in advance and add an instruction "mov eax, ecx".
> So I can eliminate all AGIs and speed up the code.
> Surprised!!  I have other optimized codes for running on Pentium.
> I will post to this newsgroup.  TO BE CONTINUED.

That will be interesting.  

Did you benchmark your improved code against Win32Forth standard edition?
Your efforths would have a many times larger influence in a subroutine-
threaded system like, for instance, bigForth.

-marcel



Wed, 10 Nov 1999 03:00:00 GMT  
 Optimized "DOCOL" in Win32For on Pentium

Quote:

> I'm still not ready to use it though, because of what could happen if an
> interrupt occurs between the mov and the sub, or when I'd want to use a
> timer to switch Forth tasks? Some of my own code switches ebp and esp
> sometimes to push and pop to the Forth return stack (so ebp points to the
> machine stack for a while). Your trick combined with an interrupt or
> task-switch might produce amusing results. Or is it guaranteed that no
> interrupt will be serviced in certain states of the U - V pipelines? Does
> the Pentium auto-switch stacks when an interrupt occurs?

All of the above. First, no asynchronous interrupt would happen between
U-V pipelines. But as synchronous interrups (page faults, trace
exceptions) may break up between U and V pipeline, we should have a
closer look: The Pentium also changes the CPU stack on exceptions. So
there's only one case when you need to care: E.g. you have a tracer that
single steps all instructions. Since it's a Forth tracer, it works on
the same privelege level and doesn't change the stack. You must take
care then!

That's how I take care: Never trace in between a macro. The code part
that really checks this uses the current stack (either data or return
stack), and the current stack always is changed with push and pop (the
stack pointer is correct, then). BTW: There's no AGI stall if you push
and pop. You can even pair pushs and pops (in any order) (execute two of
them per cycle). I wish Intel gave us at least two stacks.

Quote:
> Did you benchmark your improved code against Win32Forth standard edition?
> Your efforths would have a many times larger influence in a subroutine-
> threaded system like, for instance, bigForth.

It's used since the very beginning of bigFORTH-386, since even the 386er
has an AGI stall (2 cycles, as almost anything takes <i486>*2 cycles).
And it works, even if you trace >R (that's an example of this coding
style).

--
Bernd Paysan
"Late answers are wrong answers!"
http://www.informatik.tu-muenchen.de/~paysan/



Thu, 11 Nov 1999 03:00:00 GMT  
 
 [ 3 post ] 

 Relevant Pages 

1. "Can't optimize #repeat" warning

2. string.join(["Tk 4.2p2", "Python 1.4", "Win32", "free"], "for")

3. Pentium "Timing" problem

4. Pentium III "katmai" instructions

5. "invd" in pentium

6. BEGIN{want[]={"s1o", "s2o", "s2q", "s3q"}

7. Optimized Pentium code on a Pentium Pro

8. Parsing ""D""?

9. "Fifth", "Forth", zai nar?

10. Ruby "finalize", "__del__"

11. beginners "let"/"random" question

12. ANNOUNCE: new "plus"- and "dash"-patches available for Tcl7.5a2/Tk4.1a2

 

 
Powered by phpBB® Forum Software