APL vs. Assembler benchmark 
Author Message
 APL vs. Assembler benchmark

Here's another APL vs. a-faster-language benchmark, using a program that
I recently translated to assembler for a client.  The small (10 line)
APL program is inherently iterative--the output from one iteration is
the input to the next iteration.  Within the loop, all the operations
apply to vectors (of shorter and shorter length with each iteration).
The operations in the loop are: indexing (4), addition (2),
multiplication (2), division (2), drop (2), and take (1), plus the usual
loop control operations.  The arithmetic is floating point, and the
assembler code uses '387 floating point operations (as does the
interpreter).

   To produce the assembler routine, an "assembler prototype" was first
written in APL; this is a version of the program in which all function
arguments are scalars.  The prototype was timed along with the original
program and the assembler program to give an idea of what you can expect
if you completely ignore APL's array handling capabilities.  The
following table gives the execution times for the programs on my Toshiba
3100SX, 16 MHz 80386SX computer (#WA=3.9e6) using APL*PLUS II/386
version 3.3:

                  ASM
       N       Prototype            APL          ASM
      --    --------------     ------------     -----
       2      .0974 ( 42x)      .0438 (19x)     .0023
       5      .4066 ( 99x)      .0931 (23x)     .0041
      10     1.4178 (151x)      .2006 (21x)     .0094
      20     5.2329 (177x)      .4956 (17x)     .0296
      40    20.2205 (190x)     1.4145 (13x)     .1063
      80    79.2102 (193x)     4.5667 (11x)     .4105

  N:  The initial size of the arrays handled in the loop, and the number
      of iterations of the loop.

ASM Prototype:  Execution time for the scalar APL program, in seconds.
      The "NNx" figure is the prototype execution time divided by the
      assembler routine's execution time.

APL:  Execution time for the original APL program.

ASM:  Execution time for the assembler code.

   Both the APL and the assembler program are tightly written--there are
no unnecessary operations, but I didn't spent a lot of time trying to
see if a little more time could be squeezed out of the execution time
for either version.  In short, good production-quality code.

   There is one significant difference between the APL and assembler
code:  In the setup code before the loop, the APL program performs some
calculations on square matrices, but only the lower-triangular portion
(and diagonal) is really needed.  The assembler program avoids doing the
arithmetic on the upper-triangular portion.  On the one hand, this makes
this benchmark of questionable value since the APL program is doing more
work than the assembler program.  But on the other hand, this is how
triangular procedures are often coded in APL (even though there are
better, but more complicated, methods), so maybe the benchmark isn't so
meaningless.  In any case, the wasted time in the APL program can be
measured and subtracted for a more balanced comparison.  In the APL
version, the setup code accounts for 40% of the total time for N=10, 48%
for N=20, 58% for N=40, and 68% for N=80.  If the APL code didn't
perform these wasted arithmetic operations, the execution time for N=80
would be about


        W{<-}(80{times}81{divide}2){divide}80*2

        (4.5667{times}1-.68) + W{times}4.5667{times}.68

or 3.2 seconds, which is about 8 times slower than the assembler code,
compared with 11 times slower when the wasted arithmetic is included.
(The assembler prototype, by the way, avoids the wasted arithmetic.)

   As seen before in my April 23 benchmark posting (the singular value
decomposition example), the ratio between the APL and compiled program
improves as the array size increases.  This is because the fixed
overhead of interpretation and memory management becomes a smaller
fraction of the total time as more time is spent processing longer
vectors.  The anamolous ratio for N=2 is, I believe, because the fixed
overhead of calling the assembler routine (via #CALL) and the extensive
checking that the assembler routine performs on the inputs has become a
significant portion of the total time.

   The ratios for the assembler prototype vividly demonstrate why
looping has such a bad reputation in APL.  Of course, what's really bad
about this version is that it processes scalars throughout.  If you
unrolled all the loops for a specific case, making it a loopless scalar
algorithm, it would still be miserably slow.

                                                Jim



Wed, 09 Apr 1997 23:28:47 GMT  
 
 [ 1 post ] 

 Relevant Pages 

1. Cellular automata benchmarks: Java vs C++ vs Java vs C++

2. Smalltalk vs C vs C++ benchmark results

3. fibonacci benchmark -- VS 3.0 VS VW 2.0

4. python vs java (vs tcl?) benchmarks

5. Dyalog APL vs APL*PLUS UNX

6. APL*PLUS III vs Dyalog APL/W

7. Sharp APL vs IBM APL ??????

8. VS APL and APL*PLUS III

9. APL BENCHMARKS

10. APL benchmarks

11. Benchmarks for APL Compilers

12. STSC APL benchmark

 

 
Powered by phpBB® Forum Software