APL vs. Assembler benchmark

Here's another APL vs. a-faster-language benchmark, using a program that

I recently translated to assembler for a client. The small (10 line)

APL program is inherently iterative--the output from one iteration is

the input to the next iteration. Within the loop, all the operations

apply to vectors (of shorter and shorter length with each iteration).

The operations in the loop are: indexing (4), addition (2),

multiplication (2), division (2), drop (2), and take (1), plus the usual

loop control operations. The arithmetic is floating point, and the

assembler code uses '387 floating point operations (as does the

interpreter).

To produce the assembler routine, an "assembler prototype" was first

written in APL; this is a version of the program in which all function

arguments are scalars. The prototype was timed along with the original

program and the assembler program to give an idea of what you can expect

if you completely ignore APL's array handling capabilities. The

following table gives the execution times for the programs on my Toshiba

3100SX, 16 MHz 80386SX computer (#WA=3.9e6) using APL*PLUS II/386

version 3.3:

ASM

N Prototype APL ASM

-- -------------- ------------ -----

2 .0974 ( 42x) .0438 (19x) .0023

5 .4066 ( 99x) .0931 (23x) .0041

10 1.4178 (151x) .2006 (21x) .0094

20 5.2329 (177x) .4956 (17x) .0296

40 20.2205 (190x) 1.4145 (13x) .1063

80 79.2102 (193x) 4.5667 (11x) .4105

N: The initial size of the arrays handled in the loop, and the number

of iterations of the loop.

ASM Prototype: Execution time for the scalar APL program, in seconds.

The "NNx" figure is the prototype execution time divided by the

assembler routine's execution time.

APL: Execution time for the original APL program.

ASM: Execution time for the assembler code.

Both the APL and the assembler program are tightly written--there are

no unnecessary operations, but I didn't spent a lot of time trying to

see if a little more time could be squeezed out of the execution time

for either version. In short, good production-quality code.

There is one significant difference between the APL and assembler

code: In the setup code before the loop, the APL program performs some

calculations on square matrices, but only the lower-triangular portion

(and diagonal) is really needed. The assembler program avoids doing the

arithmetic on the upper-triangular portion. On the one hand, this makes

this benchmark of questionable value since the APL program is doing more

work than the assembler program. But on the other hand, this is how

triangular procedures are often coded in APL (even though there are

better, but more complicated, methods), so maybe the benchmark isn't so

meaningless. In any case, the wasted time in the APL program can be

measured and subtracted for a more balanced comparison. In the APL

version, the setup code accounts for 40% of the total time for N=10, 48%

for N=20, 58% for N=40, and 68% for N=80. If the APL code didn't

perform these wasted arithmetic operations, the execution time for N=80

would be about

W{<-}(80{times}81{divide}2){divide}80*2

(4.5667{times}1-.68) + W{times}4.5667{times}.68

or 3.2 seconds, which is about 8 times slower than the assembler code,

compared with 11 times slower when the wasted arithmetic is included.

(The assembler prototype, by the way, avoids the wasted arithmetic.)

As seen before in my April 23 benchmark posting (the singular value

decomposition example), the ratio between the APL and compiled program

improves as the array size increases. This is because the fixed

overhead of interpretation and memory management becomes a smaller

fraction of the total time as more time is spent processing longer

vectors. The anamolous ratio for N=2 is, I believe, because the fixed

overhead of calling the assembler routine (via #CALL) and the extensive

checking that the assembler routine performs on the inputs has become a

significant portion of the total time.

The ratios for the assembler prototype vividly demonstrate why

looping has such a bad reputation in APL. Of course, what's really bad

about this version is that it processes scalars throughout. If you

unrolled all the loops for a specific case, making it a loopless scalar

algorithm, it would still be miserably slow.

Jim