Puzzler: Binomial Coefficients
Author Message
Puzzler: Binomial Coefficients

Appended below is a fast utility for finding prime factors on APL*PLUS
II/386 and +III/Windows.  It mimics monadic q: in J.  Some examples:

FACTOR 12
2 2 3

FACTOR 1 33 42 100035
0  0 0 0 0  0  0
3 11 0 0 0  0  0
2  3 7 0 0  0  0
3  3 3 3 5 13 19

{times}/1{max}FACTOR 1 33 42 100035
1 33 42 100035

FACTOR 0

The speed of the factoring code is respectable (for example, on my
486/66 running +II/386 v5.2, factoring 130000+{iota}5000 takes 0.212
secs, compared with 4.12 secs for J2.06), but the PRIMESTO function
(another FastFn, included in one of my March 14 postings) can't hold a
candle to J's prime finder.  For example, FACTOR takes 0.18 secs to
factor (2*31)-1 (the largest 32-bit integer, itself a prime), whereas J
does it in 0.034 secs.  Almost all the time is spent in PRIMESTO, which
takes 0.174 secs to find primes up to 46,340.  In contrast, finding
these primes with p: in J takes 0.0242 secs, seven times faster.  I
should have used a sieve instead of Algorithm P in PRIMESTO.

Jim

{del} Z{<-}P FACTOR A;S;#IO

+}{max}/A)*.5

[6]    #IO{<-}1
[7]   L1:{->}(''{rho}S{<-}(#STPTR'Z A P')#CALL FACTOR{delta}OBJ)/L2

+}the end

[10]   {->}0

[14]  L3:#ERROR(5 7 8{iota}1{take}S){pick}'RANK ERROR' 'VALUE ERROR' 'WS {+
+}FULL' 'DOMAIN ERROR'
{del}

{del}.    FACTOR{delta}OBJ{<-}1345730611 858915563 {neg}1990005629 171613501{+
+}2 35931273 610044262 1284072986 {neg}326421980 34031046 35603910 37176774 {+
+}411078 1983942 3556806 1620070 4557158 407210342 809863526 1212532582 {neg{+
+}}1201274880 {neg}397410303 0 2021666392 1968722091 {neg}339161852 60955008{+
+}4 409832806 178278 59472 777519104 1351448704 {neg}855345832 {neg}16454712{+
+} 1919296596 612139129 762577436 472138950 112748034 15224832 1476395008 16{+
+}89813038 1358954495 {neg}855345832 {neg}16454712 1936073812 87909896 {neg}{+
+}347376640 611093305 812485962 178278 59472 777519104 {neg}13256576 1481703{+
+}423 {neg}926088075 1425999083 510813732 1243901067 36994432 2105542773 175{+
+}440181 28515819 78844139 95620331 129173739 145949931 {neg}1070399253 {neg{+
+}}202783862 {neg}973078528 {neg}972946363 {neg}972946107 {neg}1962801339 11{+
+}66614597 340117264 32 {neg}697244681 1962998403 138774993 1711304077 13421{+
+}78232 232 {neg}2144446464 {neg}82248 1968722175 {neg}339161852 609550084 {{+
+}neg}1951698330 {neg}1958075284 1300958333 {neg}54512888 1166781427 4828392{+
+}4 {neg}1957804663 2106133629 541428562 50520257 1166615621 944081750 50520{+
+}257 1166621765 {neg}1983892646 1569414725 860416804 {neg}137851950 1641759{+
+}0 104534132 {neg}964486018 1002539780 {neg}411936139 378212587 {neg}947710{+
+}071 81156 125042688 260691435 {neg}1962621053 1383959495 1161545515 {neg}1{+
+}996261794 2097372741 79921998 {neg}1957528183 1564163189 {neg}1961659562 8{+
+}1155 {neg}1945174016 {neg}258 {neg}930360972 1569436139 49004894 {neg}9979{+
+}98549 50018 960317 105017

{del}:  FACTOR{delta}OBJ[#IO]{<-}(1345730611 2000042035)['3/'{iota}{+
+}#SYSID[#IO+12]]

Below is an APL model of the assembler function.  For the
130000+{iota}5000 test case, it runs about 660 times slower than the
assembler routine.  (And this understates the APL:asm ratio because 35%
of the execution time for the FastFn is spent doing the two APL lines
[8] and [9].)  It's a pity that on PCs we can't compile scalar-iterative
programs like this to produce something that runs nearly as fast as the
hand-written assembler code.  (RBE: Tell me I'm wrong!)

{del} Z{<-}P FACTOR{delta}APL A;C;I;J;K;N;R;S;KMAX

[2]    {->}(0{/=}#NC'P')/L2
[3]    P{<-}PRIMESTO {floor}(0{max}{max}/,A)*.5
[4]   L2:S{<-}{rho}A
[5]    A{<-},A

[7]    KMAX{<-}0
[8]    I{<-}K{<-}1

[10]

+}machine instruction!)

[15]   J{<-}J+1

[18]   {->}L8
[19]

[21]  L10:Z[I;K]{<-}P[J]
[22]   K{<-}K+1

[25]

[29]  L8:Z[I;K]{<-}C
[30]   K{<-}K+1
[31]

[34]   K{<-}1
[35]   I{<-}I+1
[36]  L1:J{<-}1
[37]   {->}(I>{rho}A)/L4

[39]   #ERROR(N<1)/'DOMAIN ERROR'

[42]   {->}L6
[43]  L4:
[44]

{del}

Sun, 06 Sep 1998 03:00:00 GMT
Puzzler: Binomial Coefficients

Jim Weigang writes on Wednesday, March 20:

Quote:
> Appended below is a fast utility for finding prime factors on APL*PLUS
> II/386 and +III/Windows.  It mimics monadic q: in J.  Some examples:
>  ...
>    The speed of the factoring code is respectable (for example, on my
> 486/66 running +II/386 v5.2, factoring 130000+{iota}5000 takes 0.212
> secs, compared with 4.12 secs for J2.06), but the PRIMESTO function
> (another FastFn, included in one of my March 14 postings) can't hold a
> candle to J's prime finder.  For example, FACTOR takes 0.18 secs to
> factor (2*31)-1 (the largest 32-bit integer, itself a prime), whereas J
> does it in 0.034 secs.  Almost all the time is spent in PRIMESTO, which
> takes 0.174 secs to find primes up to 46,340.  In contrast, finding
> these primes with p: in J takes 0.0242 secs, seven times faster.  I
> should have used a sieve instead of Algorithm P in PRIMESTO.

In J2.06 (and J3.01), q: worked on its argument one scalar at a time,
and q: v=.130000+i.5000 computed essentially the same list of primes
5000 times.  Providing integrated rank support in q: improved the
situation somewhat: On a Pentium P90 running Windows NT, q: v
takes 1.7425 seconds in J3.01, versus 0.1472 seconds in J3.02x.
The factor of 11.8 improvement is short of the factor of 19.4
superiority of the assembler version over the J2.06 version;
I attribute it to the ability to compute the quotient and remainder
in one machine instruction in the assembler version.

p.s. In line [6] of the APL function "FACTOR", the result matrix Z
can be initialized to 30 columns instead of 32 columns.  (A signed
4-byte integer can not have more than 30 prime factors.)

Mon, 07 Sep 1998 03:00:00 GMT
Puzzler: Binomial Coefficients
Jim Weigang writes on Wednesday, March 20:

Quote:
> Appended below is a fast utility for finding prime factors on APL*PLUS
> II/386 and +III/Windows.  It mimics monadic q: in J.  Some examples: ...
>    The speed of the factoring code is respectable (for example, on my
> 486/66 running +II/386 v5.2, factoring 130000+{iota}5000 takes 0.212
> secs, compared with 4.12 secs for J2.06),  ...

Roger Hui writes on Thursday, March 21:

Quote:
> In J2.06 (and J3.01), q: worked on its argument one scalar at a time,
> and q: v=.130000+i.5000 computed essentially the same list of primes
> 5000 times.  Providing integrated rank support in q: improved the
> situation somewhat: On a Pentium P90 running Windows NT, q: v
> takes 1.7425 seconds in J3.01, versus 0.1472 seconds in J3.02x.
> The factor of 11.8 improvement is short of the factor of 19.4
> superiority of the assembler version over the J2.06 version;
> I attribute it to the ability to compute the quotient and remainder
> in one machine instruction in the assembler version.

With a few other improvements to q:, the time for q: v is now
down to  0.1152, a factor of 15.1 improvement over J3.01, but
still short of the factor of 19.4 achieved by the hand-coded
assembler version.

Jim Weigang:

Quote:
>    Below is an APL model of the assembler function.  For the
> 130000+{iota}5000 test case, it runs about 660 times slower than the
> assembler routine.  (And this understates the APL:asm ratio because 35%
> of the execution time for the FastFn is spent doing the two APL lines
> [8] and [9].)  It's a pity that on PCs we can't compile scalar-iterative
> programs like this to produce something that runs nearly as fast as the
> hand-written assembler code.  (RBE: Tell me I'm wrong!) ...

I was curious about the analogous ratio between the primitive q:
and a J model of it (i.e. the ratio between hand-coded C and J),
so I wrote the following model:

pa    =: [ #~ 0: = |

qa    =: 4 : 0
z=.p=.x. pa r=.y.
while. #p do. z=.z,p=.p pa r=.<.r%*/p end.
/:~z,r-.1
)

qco   =: primes qa"1 0 ]

"qco" is the main function and is a model of q: .

The dyad "qa" applies to a scalar right argument of the
integer to be factored, and a vector left argument of
the prime divisor candidates.  Within the function,
r is the remainder to be factored and p are the remaining
prime candidates.

"primes" provides a list of all the primes required by "qa".
The subfunction "pn" provides an estimate (upper bound) of
the number of primes less than the argument, based on the
prime number theorem.

The following examples illustrate the model in action:

qco 1 10 100 1000
0 0 0 0 0 0
2 5 0 0 0 0
2 2 5 5 0 0
2 2 2 5 5 5

y=.314159265  NB. number to be factored
p=.primes y
10{.p
2 3 5 7 11 13 17 19 23 29
_10{.p
20857 20873 20879 20887 20897 20899 20903 20921 20929 20939
\$p
2355
pn %: y
2355

p0=.p
r0=.y
[ p1=.p0 pa r0
3 5 7 127 7853
[ r1=.r0 % */ p1
3
[ p2=.p1 pa r1
3
[ r2=.r1 % */ p2
1
[ p3=.p2 pa r2

qco y
3 3 5 7 127 7853

For v=.130000+i.5000, qco v takes 10.245 seconds, compared to
0.1152 seconds for q: v; in this case the hand-coded C is
faster by a factor of 89.

Mon, 07 Sep 1998 03:00:00 GMT
Puzzler: Binomial Coefficients

Quote:

> With a few other improvements to q:, the time for q: v is now
> down to  0.1152, a factor of 15.1 improvement over J3.01, but
> still short of the factor of 19.4 achieved by the hand-coded
> assembler version.

Impressive!  The remaining difference from assembler is practically
insignificant compared to the improvement so far.  In achieving a
15.1x speedup you've gone from 1.743 secs down to 0.115 secs; achieving
a 19.4 ratio would involve saving only 0.025 additional secs, a mere
1.5% of the original figure.  (As Clark Wiedmann used to remind me,
speed ratios can be a deceptive measure of time savings.)

Quote:

>    Below is an APL model of the assembler function.  For the
> 130000+{iota}5000 test case, it runs about 660 times slower than the
> assembler routine.

> I was curious about the analogous ratio between the primitive q:
> and a J model of it (i.e. the ratio between hand-coded C and J), ...
> For v=.130000+i.5000, qco v takes 10.245 seconds, compared to
> 0.1152 seconds for q: v; in this case the hand-coded C is
> faster by a factor of 89.

My FACTOR{delta}APL function was a direct translation of the assembler
program to APL using only scalar operations, and it was provided mostly
to document the algorithm used by FACTOR.  Roger's qco program uses
vector operations, speeding it up greatly over pure scalar code, so it's
not really comparable to FACTOR{delta}APL.  Here's an APL version of
qco:

{del}Z{<-}QCO N;P

[2]   P{<-}PRIMESTO {floor}(0{max}{max}/,N)*.5
[3]   Z{<-}{disclose}({enclose}P)QA{each}N
{del}

{del} Z{<-}P QA N

[2]    Z{<-}P{<-}(0=P|N)/P
[3]   L1:{->}(0={rho}P)/L2
[4]    N{<-}{floor}N{divide}{times}/P
[5]    Z{<-}Z,P{<-}(0=P|N)/P
[6]    {->}L1
[7]   L2:Z{<-}Z,N~1
{del}

(Note:  QCO uses APL2-style or APL*PLUS Evlevel 2 nested array
operations.  The {disclose} on line [3] is a mix-type disclose, not
first.)

For the test case of 130000+{iota}5000, QCO runs in 10.78 secs
(486/66, +II/386 v5.2), which I estimate to be about 40 times slower
than Roger's latest q: .  (But perhaps Roger could time QCO on +III
using the same computer as for the J3.02x timings.)  It is this factor
of 40, not the 660, which should be compared with J's factor of 89.

Jim

Wed, 09 Sep 1998 03:00:00 GMT
Puzzler: Binomial Coefficients

Jim Weigang writes on Saturday, March 23:

Quote:
> Impressive!  The remaining difference from assembler is practically
> insignificant compared to the improvement so far.  In achieving a
> 15.1x speedup you've gone from 1.743 secs down to 0.115 secs; achieving
> a 19.4 ratio would involve saving only 0.025 additional secs, a mere
> 1.5% of the original figure.  (As Clark Wiedmann used to remind me,
> speed ratios can be a deceptive measure of time savings.)  ...

How far it has come matters less than how far it has to go.
But am I not complaining in this case:  hand coding in C,
coming within 30% of hand-coded assembler is probably the best
that can be expected, esp. when the one-instruction quotient/
remainder facility plays such an important role.

Quote:
> My FACTOR{delta}APL function was a direct translation of the assembler
> program to APL using only scalar operations, and it was provided mostly
> to document the algorithm used by FACTOR.  Roger's qco program uses
> vector operations, speeding it up greatly over pure scalar code, so it's
> not really comparable to FACTOR{delta}APL.  Here's an APL version of
> qco:  ...

Yes, I was aware of that difference.  My interest in the model
is not as a direct translation of the C code.  Generally,
I find such translations not as useful as a model written in the
best J style.  This is the case whether the model is written after
the fact (as for q: here) or especially beforehand, when I am
groping for an understanding of the problem (see the model on the
gamma function posted a few weeks ago).  Having written the model
"in the best J style", the hand translation into C is relatively
straightforward.

The timing on qco was to answer the question, regarding execution
speed, how close can the best J that I can write come to the best C
that I can write?  (Answer: not yet as close as one would like.)

Quote:
>    For the test case of 130000+{iota}5000, QCO runs in 10.78 secs
> (486/66, +II/386 v5.2), which I estimate to be about 40 times slower
> than Roger's latest q: .  (But perhaps Roger could time QCO on +III
> using the same computer as for the J3.02x timings.)  It is this factor
> of 40, not the 660, which should be compared with J's factor of 89.

I entered the QCO function into APL*PLUS III v1.2 running on NT on
a P90, and the result on the argument 130000+{iota} 5000 (in 0-origin)
is 4.636 seconds, so the estimate of a factor of 40 slower that q:
is very accurate.

q:   J3.01       1.7425
q:   J3.02x      0.1152
qco  J3.02x     10.245
QCO  APL*PLUS    4.636

In running QCO, I had to use just a stub for the function primesto,
as I do not have the assember version of that.

{del} z {=.} primesto n
z {=.} (n {>:} PRIMES)/PRIMES
{del}

On the argument in question, this takes 0.0003 seconds, which I
gather is same insignificant magnitude that the assembler function
would have taken, relative to the time for QCO itself.

Thu, 10 Sep 1998 03:00:00 GMT

 Page 1 of 1 [ 5 post ]

Relevant Pages