Performance improvement opportunities 
Author Message
 Performance improvement opportunities

There has been a flurry of postings about performance improvement in the
past week or two all related to compiling python or speeding up the
interpreter.  I thought it would be worthwhile to summarize the various
options and their pros and cons as I see them, so that everyone is reading
from the same script.  This is just a first pass at some thoughts on the
subject.  Additions are welcome.

Interpreter improvements

    The goal is to improve the performance of the interpreter in various
    ways.  Opportunities include

    * improving name lookup
    * peephole optimization of the byte code compiler output
    * inlining critical interpreter functions


    * all applications benefit
    * no (or only minor) semantic changes


    * speedup is limited (perhaps a factor of 2 increase in performance)

Compiling Python

    There are two general avenues for compiling Python code: "compile" the
    output of the byte code compiler or compile Python source code (or the
    abstract syntax tree generated by the parser module).  Compiling the
    byte codes to C or object code is really more of an assembler or macro
    processor type operation, hence the quotes.  I haven't really thought
    about any ramifications for just-in-time compilation in what follows.
    All of what I wrote is geared toward compilation when the application is
    not executing.  Also, I assume compilation will typically be to C code,
    to avoid obvious portability problems.

    Compiling the byte codes

        This can be thought of as creating a special-purpose version of
        eval_code2 (in file ceval.c) for each function compiled in this
        fashion.  You are really just unrolling the for loop in eval_code2
        so the C compiler can have a shot at optimizing things.


        * probably less effort to implement than a true compiler


        * possible semantic changes to the language (see below)
        * speedup is limited (you don't eliminate the interpreter's stack)
        * tougher to make high-level decisions (e.g., type inference) at
          this level
        * functions generated using this technique can get huge
        * probably will expose the Python interpreter internals outside of

    Compiling Python

        This is true compilation as we normally think of it.  The output can
        be C or object code.  In theory, you can do all the cool things
        people do with compilers (type inference, dead code removal, loop
        unrolling, yada yada yada) to optimize things.


        * potentially the greatest speedup possible


        * hardest to implement well - probably need accurate type
          information to get order of magnitude types of speedup
        * probable semantic changes to the language (see below)
        * some constructs (e.g., class definition, exception handling) are
          tightly tied to the interpreter

Changing the language

In past discsussions about compiling Python, the topic that has always
arisen is the idea of adding type declarations to the language or to doc
strings or something similar, so that the compiler can make more intelligent
choices without having to implement a sophisticated type inference engine.
One obvious win is to know that a loop index variable is an integer so that
a simple C for loop can replace a slower, more general-purpose loop that
relies on on fetching and storing information from Python objects.

The issue of user-visible changes to the language aside, I think a more
important consideration may be the implied semantic changes to the language
when you entirely (or almost entirely) eliminate the interpreter itself.
The interpreter handles exceptions at the granularity of a byte code
instruction.  Most of the time such things get handled quickly, ignoring the
occasional time spent in a long-running C function.  What happens when
almost all the execution occurs outside the control of the interpreter?
When is it safe to handle a KeyboardInterrupt or OverflowError?  When you
migrate integer computations from Python to C, will you give up the ability
to catch OverflowError exceptions altogether?  I'm sure these problems are
solvable, but they will require some work.

Skip Montanaro     |       Musi-Cal:   http://www.*-*-*.com/

(518)372-5583      |    long as you eat at home." -- Sloan Wainwright

Fri, 02 Jul 1999 03:00:00 GMT  
 Performance improvement opportunities


> There has been a flurry of postings about performance improvement in the
> past week or two all related to compiling Python or speeding up the
> interpreter.  I thought it would be worthwhile to summarize the various
> options and their pros and cons as I see them, so that everyone is reading
> from the same script.  This is just a first pass at some thoughts on the
> subject.  Additions are welcome.

I have an addition to make, and that is that those thinking about
should think again -- compilation by itself is no magic elixir for
improvement. As Python presently stands I would not expect compilation
to speed
Python programs up very much at all. The problem is that Python's
require* that an implementation does a lot of dynamic lookups (eg.
object field
names). Compilation will not remove such lookups. Really efficient
remove such lookups by doing static type analysis, but this of course
a language with type declarations, or a language in which the types of
can be inferred. The only alternative is to use some of the
optimisations in
the Self compiler -- but these are non-trivial optimisations requiring a
of programming effort. Moreover its not 100% clear that these
are applicable to Python since in some respects Python is more dynamic

In short it is potentially possible to compile Python using Self style
optimisations, but it is an enormous amount of work. A simple minded
quick and dirty compiler will not improve code speed significantly. My
opinion therefore is to spend one's time improving the interpreter, and
forget compilation.

        He'll be riding that fast train to Memphis
                    On that sweet R & B
            Till the green grass way down home
                     Grows up over me

Sat, 03 Jul 1999 03:00:00 GMT  
 Performance improvement opportunities

For the fun of it, I decided to translate a Python function into C code.
I was careful to copy the code *directly* from ceval.c to see if it would
make a difference.  Here's Version 1 of the function:

    # Calculate a CRC (crctbl is global list of 256 constant integers)
    def crcblockorig(str, crc):
        for i in str:
            crc = ((crc << 8)) ^ crctbl[((crc >> 8) ^ ord(i)) & 0xFF]
        return crc

Yes, yes of course this is the ideal candidate to redo directly in C,
but I was surprised how slow it was in python.  On my Pentium 100,
this python version will calculate a CRC on 16K per second.  For my
needs it was slow. Right on the border of "good enough".

I was able to speed it up using locals.  Here's Version 2:

    def crcblock(str, crc, crctbl):
        ord = __builtins__.ord
        for i in str:
            crc = ((crc << 8)) ^ crctbl[((crc >> 8) ^ ord(i)) & 0xFF]
        return crc

This was 40% faster, and was "good enough".

I thought this sort of code could be made faster with a compiler.  The
first step was to see if there was anything to get out of a
translation to C code from the byte-code.  So I hand-generated the
code by copying stuff from ceval.c into a C module.  Pushes, pops,
everything.  (Well, ok, so I trashed line numbers, and the frame for
the loop since I didn't understand it.)

This came out to a 268 line C function.  I had to include a bunch of
functions static to ceval.c.  That added another 146 lines.

The "compiled" code runs 48% faster than the optimized equivalent
python version.

I went on to see what a difference the Self-ish improvements would
make, and I can see that it would be a lot of work to implement such a
thing.  I noticed that a useful customized version of this function
could be generated by establishing the (constant) type of "ord" before
the main loop.  By doing that I could remove tuple-fication of the
function argument and inline the ord function. The resulting speed up
was 203% over the original C code, or 3 times faster than the
hand-tuned Python code.  There's still a lot of other junk in the
result, including the verification of the type and length of the
string object handed to ord for each loop iteration.  Still, this is a
reasonable thing to be able to generate.

I know all these numbers are basically worthless, but it does show
that in one case, a very simple translator *could* translate python
byte codes into potentially faster C code.  Not a lot faster, but a
little faster.  Also, there is a huge potential for speeding up the
code even further if you want to work at it.

You can find the code at: - a test program to run the different versions of CRC
                     I used it to disassemble the python bytecodes, too.
        fast.c    - the original C version
        fast2.c   - the above with ord function call optimization

Eric C. Newton        Software Reuser.
Newton Consulting     Look at all the code I didn't write today!

Thu, 08 Jul 1999 03:00:00 GMT  
 [ 4 post ] 

 Relevant Pages 

1. Performance improvements in 5.0?

2. Performance improvements between VW 3.0 and VW 5i

3. vw2.5 performance improvements

4. J Performance Improvements

5. A performance improvement!

6. Question on performance improvement through re-design.

7. Python 1.5b1, code review for possible performance improvements

8. ODBC performance Improvements

9. malloc performance affecting python performance

10. ST80 performance; STA performance and booleans

11. Improvement to generated ActiveX methods?

12. Refinancing, Second Mortgage, Home Improvement


Powered by phpBB® Forum Software