CPS->C, pushy vs. dispatch-loop 
Author Message
 CPS->C, pushy vs. dispatch-loop

This data may be of interest to some implementors.

I have modified a compiler, which uses CPS internally and emits
reasonably portable ANSI-C code, to use the "pushy" CPS scheme
described by Henry G. Baker in Feb. '94 in some comp.* groups.

Summary:
* Tests done on a single-user SPARCstation 5 with Solaris 2.3
  and GCC 2.5.7.
* The standard CPS->C mapping (STANDARD) using global variables for
  parameter passing (and in my case a simulated stack for continuation
  records), and a dispatch-loop function for tailcalls works well.
* The portable "pushy" CPS->C code (PUSHY) is about 3 times _slower_
  than the STANDARD code. Although parameters are passed in registers
  instead of memory, and tailcalls go directly to their targets, the
  SPARC loses due to the enormous number of register window overflows.
  (The C compilers do not recognize that the register window is dead
  after the jump to g below:
        void f(...) { auto x; ... g(&x,...); return; }              )
* Using GCC's inline assembly and register allocation declarations to
  code up macros that do tailcalls without register window growth
  (PUSHY+GCC), we gain about a factor of 3-4.5 over PUSHY. PUSHY+GCC
  is typically 10-30% faster than STANDARD.

I have not yet had the opportunity to run the tests on any other
architectures than the SPARC, so the question "do we use STANDARD
or PUSHY in our CPS->C compilers" is still left unanswered :-(

Test data:
The application is a specification for the dynamic semantics of a
normal-order (call-by-name) functional language. The compiled spec.
is an interpreter for this language. The test case was hard-coded
(no parsing at runtime): a program computing prime numbers using
the well-known sieve method.
Times below are in seconds. GC-overhead is negligible.
Non-user time is in the 0.15-0.30 seconds range for all timings.

                        (# of primes to compute)
Version         10      20      30      40      50
-----------------------------------------------------
STANDARD        0.22    0.68    1.62    3.90    8.60
PUSHY           0.48    1.85    5.30    13.20   25.41
PUSHY+GCC       0.31    0.61    1.35    3.03    5.89
--
Mikael Pettersson, Dept of Comp & Info Sci, Linkoping University, Sweden



Sun, 23 Feb 1997 21:29:23 GMT  
 CPS->C, pushy vs. dispatch-loop
Quote:

>This data may be of interest to some implementors.

>I have modified a compiler, which uses CPS internally and emits
>reasonably portable ANSI-C code, to use the "pushy" CPS scheme
>described by Henry G. Baker in Feb. '94 in some comp.* groups.

>Summary:
>* Tests done on a single-user SPARCstation 5 with Solaris 2.3
>  and GCC 2.5.7.
>* The standard CPS->C mapping (STANDARD) using global variables for
>  parameter passing (and in my case a simulated stack for continuation
>  records), and a dispatch-loop function for tailcalls works well.
>* The portable "pushy" CPS->C code (PUSHY) is about 3 times _slower_
>  than the STANDARD code. Although parameters are passed in registers
>  instead of memory, and tailcalls go directly to their targets, the

                                                                  ^^^
Quote:
>  SPARC loses due to the enormous number of register window overflows.

   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Quote:
>  (The C compilers do not recognize that the register window is dead
>  after the jump to g below:
>    void f(...) { auto x; ... g(&x,...); return; }              )
>* Using GCC's inline assembly and register allocation declarations to
>  code up macros that do tailcalls without register window growth
>  (PUSHY+GCC), we gain about a factor of 3-4.5 over PUSHY. PUSHY+GCC
>  is typically 10-30% faster than STANDARD.

>I have not yet had the opportunity to run the tests on any other
>architectures than the SPARC, so the question "do we use STANDARD
>or PUSHY in our CPS->C compilers" is still left unanswered :-(

I see that 'Cheney on the MTA' is too obscure.  Oh well, 'pushy' is
short and sweet.  :-)

Readers should be aware that the 3X slowdown is _unique_ to the SPARC
'register window' architecture.  The tests that I and a number of others
ran last spring on other architectures -- e.g., Alpha, MIPS, 68K, Intel,
etc., all showed competitive performance _without_ special function-call
hacking.  (The size of the stack buffer does have to be tuned to the
architecture, however, to achieve the best performance.)

The performance measurements found here are similar to the results from
my hand-compiled Boyer benchmark, so I am gratified that my hand compilation
didn't use optimizations not available to a normal compiler.

Thanks very much for your results.

Would it be possible to post the C code emitted for some simple functions?
(e.g., factorial, fibonacci, etc.)



Tue, 25 Feb 1997 01:07:58 GMT  
 
 [ 2 post ] 

 Relevant Pages 

1. CPS -> goto/while

2. Q: Endless loop by dispatching

3. Tcl Event Loop vs TK Event Loop issues with Asynchronous Events

4. Tail Recursive loops <-> Failure Driven loops

5. Emulate multiple dispatch in a single dispatch language

6. Emulate multiple dispatch in a single dispatch language

7. Emulate multiple dispatch in a single dispatch language

8. Test and Set (TS) vs Compare and Swap (CS)

9. to CS: or not to CS: in F-PC assembler

10. the Web IS a _computer_, HTML its language (and a pushy Scheme CGI setting it off)

11. CORRECTING -->NOW SHOWING - wiki.cs.uiuc.edu/VisualWorks/Recent+Changes

12. Leonardo=>MaxPlus/Quartus Vs Synopsys=>MaxPlus/Quartus

 

 
Powered by phpBB® Forum Software