SUMMARY: IEEE conversion (1 reply, plus corrections) 
Author Message
 SUMMARY: IEEE conversion (1 reply, plus corrections)

Tom Chwastyk wrote on 27 Nov 1995:

Quote:
> (4) The slightly more complex "DfromSok" runs about 13.2K
>     conversions/second (13 times slower than Jim's) by using
>     zero setting and subscripting in place of expansion.

The execution time of a O(N) [i.e., linear] program can be modeled as:

     RunTime = SetupTime + (N {times} PerElementTime)

where N is the number of elements in the argument.  You can estimate the
parameters by making two timings, one with small N and one with large N.
If TS and TL are the execution times and NS and NL are the N values, the
parameters can be computed using:

     SetupTime PerElementTime {<-} (TS,TL) {domino} 1,[1.5] NS,NL

But these parameters are usually so small that it's more convenient to
discuss their reciprocals, which can be interpreted as calls-per-second
(overhead) and elements-per-second (flat-out processing speed).

   On my 486/66 using APL*PLUS III v1.2 and Windows 3.1, Tom's DfromSok
function takes about 0.0034 secs to with N=1 and about 0.65 secs with
N=10,000.  My F64TO32 function (appended below) takes about 0.0015 secs
with N=1 and about 0.043 secs with N=100,000.  So the per-second
parameters for these two functions are:

                      Calls/Sec        Elts/Sec

         DfromSok        300             15,464

         F64TO32         667          2,409,614

In a particular application, the speed ratio between these two functions
may be anything from 2.2 (the calls/sec ratio) to 156 (the elts/sec
ratio), depending on the size of the arguments.  The speeds Tom reported
are consistent with an argument size of about 300 elements.

                                                Jim

     {del} Z{<-}F32TO64 C;T



   +}8 bits



[7]    T{<-}0 858915563 {neg}1992758141 1714562148 35931273 610044262 {neg}{+
   +}957576422 {neg}972945595 {neg}972939451 {neg}973076923 1711283781 {+
   +}1711282360 1711293833 1712866697 3163591 28861952 15224832 1476395008 {+
   +}{neg}1132953554 74799184 82561229 1009014015 841247883 {neg}949558309 {+
   +}{neg}51643 1714847197 2134271361 1717990656 3556807 914217216 {+
   +}912621926 {neg}2090467265 1967076989 409832784 178278 59472 777519104 {+
   +}{neg}9324416 1481703423 {neg}926088075 1425999083 628243492 841247883 {+
   +}1867136 2105544565 292880669 606619019 {neg}351898365 {neg}352210148 {+
   +}{neg}352079086 {neg}351948018 {neg}351816950 {neg}351751418 856470274 {+
   +}{neg}339506496 71681624 88458755 541428481 {neg}1996298047 1166610501 {+
   +}122013192 7179521 243814 59472 777519104 {neg}16664448 1481703423 {neg}{+
   +}926088075 1425999083 {neg}1183695836 841247883 {neg}1960544885 {+
   +}1300958333 {neg}653466872 74878214 2139955165 871686664 {neg}{+
   +}2082284608 12794052 951127 120349
[8]    T[#IO]{<-}(1345730611 2000042035)['3/'{iota}#SYSID[#IO+12]]
[9]    {->}(T{<-}''{rho}(#STPTR'Z C')#CALL T){drop}0
[10]   #ERROR(3 5 7 8 12{iota}T){pick}'LENGTH ERROR' 'RANK ERROR' 'VALUE {+
   +}ERROR' 'WS FULL' 'MATH PROCESSOR ABSENT' 'DOMAIN ERROR'

     {del}

The loop in the assembler code is:

   L1:
    FLD DWORD PTR [ESI]     ; load C[i]
    LEA ESI,[ESI+4]         ; point to next element
    FSTP QWORD PTR [EDI]    ; store into Z[i]
    LEA EDI,[EDI+8]         ; point to next element
    LOOP L1



Sat, 16 May 1998 03:00:00 GMT  
 SUMMARY: IEEE conversion (1 reply, plus corrections)

writes:

Quote:
>    On my 486/66 using APL*PLUS III v1.2 and Windows 3.1, Tom's DfromSok
> function takes about 0.0034 secs to with N=1 and about 0.65 secs with
> N=10,000.  My F64TO32 function (appended below) takes about 0.0015 secs
> with N=1 and about 0.043 secs with N=100,000.  So the per-second
> parameters for these two functions are:

>                       Calls/Sec        Elts/Sec

>          DfromSok        300             15,464

>          F64TO32         667          2,409,614

> In a particular application, the speed ratio between these two functions
> may be anything from 2.2 (the calls/sec ratio) to 156 (the elts/sec
> ratio), depending on the size of the arguments.  The speeds Tom reported
> are consistent with an argument size of about 300 elements.

>                                                 Jim

Jim and I have been discussing offline the reasons for my reporting a  
slanderously :-) low conversion rate for his F32TO64.  While the speed ratio  
of 13 that I got _might_ have resulted from an argument size of 300 elements,  
in fact it didn't.  In hopes of either preventing some of you from pulling  
similar boners, or giving others of you a chuckle over some other poor slob's  
mistakes, here's what contributed to my bum dope about speed:

1) I was using a roll-your-own execution time function whose idea goes back to  
I-beam function vintage (use #AI values for session time with {execute} of the  
expression-to-be-timed in between calls to #AI).  Back when #MF first came  
out, I looked at it and concluded it was great for performance monitoring in  
multi-line functions, but it seemed a little cumbersome to use for simple  
expression timing.  As Jim pointed out, the sensible thing is to use the EXT  
function from the MFFNS utility workspace.  This takes advantage of the both  
the MF facility and the partial compilation feature of APL III, which repeated  
executes do not.  This alone is responsible for almost a 2:1 ratio between my  
measured overhead and Jim's.  Moral: update utilities at least once a decade  
or as significant features come out.

2) The expression I was timing happened to be the one I used to check that my  
APL results agreed with Jim's ML results, with a +.= between the precomputed  
ML results and my APL results (this was where I discovered I'd forgotten the  
Intel lsB...msB order), as in 'ML+.=DfromSok x'.  I was using long enough x's  
that the +.= took 0.1 second.  "Not much" (8%!) in comparison to the 1.2  
seconds for DfromSok to execute, but pretty significant (1000%) in comparison  
to the 0.01 seconds it took F32TO64 to execute... Yep, I was checking that the  
ML result agreed with the ML result, and spending 90+% of the timing run doing  
it.  Moral: Cut and paste makes it easy to rerun statements for different  
cases --- but be sure the cases make sense.  In particular, for timing  
exercises, don't have a lot of extraneous junk code in the timing argument  
(one would think this was obvious, right?).

3) To get different size conversion arguments while suppressing screen output  
after cutting off the "ML+.=", I was timing expressions of the pattern '0  
0{rho}F32TO64 n{rho}x' for various values of n.  It turns out I was spending  
almost as much time in n{rho}x as in the conversion routine, despite the fact  
that n{rho}x is a pretty fast operation.  I got a factor of two increase in  
speed by precomputing nx{<-}n{rho}x and timing '0 0{rho}F32TO64 nx'.  Moral:  
..In particular, for timing exercises, don't have ANY extraneous junk code in  
the timing argument.

4) In view of M{*filter*}3, I tried replacing the "0 0{rho}" construction with a  
permanent destination, as in 'k{<-}F32TO64 nx'.  Result: no statistically  
significant change in overhead, but a statistically significant 11% increase  
in speed.  Moral: in a timing argument, resizing to suppress output is  
extraneous junk code (see M{*filter*}3).

Overall moral: restrict timing arguments to 'Z{<-}FOO X' with preallocated X  
and Z, and use a timing function that takes advantage of performance  
monitoring features.

Somewhat sheepishly,

Tom



Sun, 17 May 1998 03:00:00 GMT  
 
 [ 3 post ] 

 Relevant Pages 

1. SUMMARY: IEEE conversion (1 reply, plus corrections)

2. SUMMARY: Conversion of IBM 370 float to IEEE

3. IEEE CS membership cost (correction)

4. Correction to last reply

5. IEEE 1149.1 plus VHDL/Verilog

6. IEEE 1149.1 plus VHDL/Verilog

7. Correction of [Summary of Church Numeral]

8. Correction to summary text concerning PAL usage

9. Correction of [Summary of Church Numeral]

10. SUMMARY: MIDI .. for V. Correction

11. EIFFEL for Windows: status report (summary of replies)

12. SUMMARY: Do people use ieee.std_logic_1164?

 

 
Powered by phpBB® Forum Software