fp benchmark 
Author Message
 fp benchmark

Available is a benchmark program based on an ordinary
differential equation solver written originally for
kForth. There are versions for pfe and gforth as well.
The kForth version assumes an integrated fp/data stack,
while the pfe and gforth versions assume a separate
fp stack. Following are the results on my 330 MHz PII:

kforth          4.80 s
pfe             3.09 s
gforth          2.04 s
gforth-fast     1.82 s

The performance hit for an integrated fp/data stack is
not particulary bad on an x86 type processor. I have noted
that gforth (not the fast version) is typically 3 to 4 times
faster than kForth for non-fp code.

Links to the source files are:

kforth  ftp://ccreweb.org/software/kforth/benchmarks/slbench-kforth.4th
pfe     ftp://ccreweb.org/software/pfe/benchmarks/slbench-pfe-floating.4th
gforth  ftp://ccreweb.org/software/gforth/benchmarks/slbench-gforth.4th

Cheers,
Krishna



Mon, 04 Apr 2005 21:16:02 GMT  
 fp benchmark

Quote:
> Available is a benchmark program based on an ordinary
> differential equation solver written originally for
> kForth. There are versions for pfe and gforth as well.
> The kForth version assumes an integrated fp/data stack,
> while the pfe and gforth versions assume a separate
> fp stack. Following are the results on my 330 MHz PII:
> kforth          4.80 s
> pfe             3.09 s
> gforth          2.04 s
> gforth-fast     1.82 s

[..]

On my old P54c 166 MHz machine your benchmark (the gForth
variant) executes in 1.167 seconds. This should be roughly
equivalent to 587 ms on a 330 MHz PIII, or 8.17 times faster
than kForth.

In rewriting the benchmark for iForth 2.0 I got 491 ms
elapsed on the p54c (19.4 times faster). This involved
use of FLOCALS, LOCALS, VALUE, FVALUE and inlining of
all sv addresses.

-marcel

---      
FORTH> in C:\WINNT\Profiles\Administrator\Desktop\slbench.frt
Redefining tab
===================================================
Symbol  Parameter                       Value
===================================================

 t_p    Photon lifetime  (s):           4.500000E-12
 t_s    Carrier lifetime (s):           7.000000E-10
 G_N    Differential gain (cm^3/s):     0.000003
 N_th   Thr. carrier density (cm^-3):   1.500000E18
 I_th   Thr. current (mA):              20.000000
 alpha  Linewidth enhancement factor:   5.000000
===================================================
Derived Dimensionless Parameters
===================================================

t_s/t_p ratio:  155.555556
Pump factor:    8.775000
===================================================

1.167 seconds elapsed. ok
FORTH> 1167 166 330 */ . 587  ok
FORTH> 4800 100 587 */ . 817  ok



Tue, 05 Apr 2005 06:37:57 GMT  
 fp benchmark

Quote:


> > Available is a benchmark program based on an ordinary
> > differential equation solver written originally for
> > kForth. There are versions for pfe and gforth as well.
> > The kForth version assumes an integrated fp/data stack,
> > while the pfe and gforth versions assume a separate
> > fp stack. Following are the results on my 330 MHz PII:

> > kforth          4.80 s
> > pfe             3.09 s
> > gforth          2.04 s
> > gforth-fast     1.82 s

> [..]

> On my old P54c 166 MHz machine your benchmark (the gForth
> variant) executes in 1.167 seconds. This should be roughly
> equivalent to 587 ms on a 330 MHz PIII, or 8.17 times faster
> than kForth.

Thanks Marcel. Was your benchmark of 1.167 sec on a P5 166 MHz
machine obtained with iForth? Which version? The source code
for the gForth variant is different from the pfe variant only
in that pfe requires a statement to load the floating point module.

By the way, I forgot to mention the versions of the different
Forths I used with the 330 MHz PII, running SuSe Linux 8.0:

kForth  v 1.0.13
pfe     v 0.32.91
gforth  v 0.5.0

The benchmark versions of this code do not produce any output --
the output statements in the word "main" have been commented
out. I have however verified that the three Forths listed above
produce the same numeric output with their respective code. If
you wish to see the output, uncomment the F. statements and
remove the preceeding FDROP in "main".

Krishna



Tue, 05 Apr 2005 08:30:15 GMT  
 fp benchmark
Quote:




[..]

Quote:
>> On my old P54c 166 MHz machine your benchmark (the gForth
>> variant) executes in 1.167 seconds. This should be roughly
>> equivalent to 587 ms on a 330 MHz PIII, or 8.17 times faster
>> than kForth.

> Thanks Marcel. Was your benchmark of 1.167 sec on a P5 166 MHz
> machine obtained with iForth? Which version? The source code
> for the gForth variant is different from the pfe variant only
> in that pfe requires a statement to load the floating point module.

All numbers obtained with iForth 2.0.3.



BTW, on a Dell Optiplex GX 110 (833 MHz Intel PIII), the (iForth
specific) benchmark runs in 45 ms.

-marcel



Tue, 05 Apr 2005 16:30:10 GMT  
 fp benchmark

Quote:




Technically you're correct --- I use dfloats everywhere.
As a practical matter, for which Forths would this pose


Another data point: bigforth v 2.0.9 executes the benchmark
in 0.48 s on my 330 MHz PII.

Krishna



Tue, 05 Apr 2005 20:35:45 GMT  
 fp benchmark

Quote:




> Technically you're correct --- I use dfloats everywhere.
> As a practical matter, for which Forths would this pose




all options than can differ depending on which binary you load).
However, I have not found any good reason yet for using 80 or 32
bits by default.


As the FP-library is a loadable option (a text file) there is a
slightly higher chance that somebody gets in trouble with your code.

-marcel



Tue, 05 Apr 2005 20:52:59 GMT  
 fp benchmark

Quote:




> Technically you're correct --- I use dfloats everywhere.
> As a practical matter, for which Forths would this pose





numbers. For example, my hardware uses 9 and 14 bytes BCD formats (yes,


 Cheers,

 Sebastien.



Tue, 05 Apr 2005 22:48:51 GMT  
 fp benchmark

Quote:


> > Technically you're correct --- I use dfloats everywhere.
> > As a practical matter, for which Forths would this pose




> numbers. For example, my hardware uses 9 and 14 bytes BCD formats (yes,


>  Cheers,

>  Sebastien.

Which Forth system are you using on this hardware?

are the precise words to use when working with DFLOAT
size fp numbers. But, I have gotten in the (lazy) habit of

of the systems I use. This may well come back to haunt me
someday!

Krishna



Wed, 06 Apr 2005 05:46:01 GMT  
 fp benchmark

[..]

Quote:
>>> On my old P54c 166 MHz machine your benchmark (the gForth
>>> variant) executes in 1.167 seconds. This should be roughly
>>> equivalent to 587 ms on a 330 MHz PIII, [...]

[..]

Here is slbench with routines from the FSL and a plotter-type
visualization. I'm using the variable step Runge-Kutta ODE solver
with an eps of 1e-3. Instead of 1.167 seconds it now takes 15 ms
(on my P54c) to get the (nice-looking) output.

-marcel
-- -----------
ANEW -slbench3

NEEDS -rk4
NEEDS -sketcher

DOC
(*
 Benchmark version of sl for iForth

 Solve the semiconductor laser rate equations, given a current pulse
 profile. Output the time, intensity, phase, and current density.

 Krishna Myneni, 1-26-2000
*)
ENDDOC

\ ================
\ Laser parameters
\ ================

4.5e-12 FVALUE t_p              \ photon lifetime (sec)
700e-12 FVALUE t_s              \ carrier lifetime (sec)
2.6e-6  FVALUE G_N              \ differential gain (cm^3/s)
1.5e18  FVALUE N_th             \ threshold carrier density (cm^-3)
20e-3   FVALUE I_th             \ threshold current through laser (mA)
5e      FVALUE alpha            \ linewidth enhancement factor

\ ========================
\ Dimensionless parameters
\ ========================

0e FVALUE T_ratio               \ T_ratio = t_s/t_p
0e FVALUE PumpFactor            \ P = PumpFactor*(I/I_th - 1)
0e FVALUE P

\ compute the normalized parameters
: init_params ( -- )
        t_s t_p F/              TO T_ratio
        t_p G_N F* N_th F* F2/  TO PumpFactor ;

\ =============================
\ The injection current profile
\ =============================

1e-9          FVALUE fwhm               \ full-width at half-max for
                                        \ current pulse in ns (1 ns)
20e-3         FVALUE pulse_amp          \ current pulse amplitude
                                        \ above d.c. level (20 mA)
I_th 10e-3 F+ FVALUE dc_current         \ d.c. current level (10 mA
                                        \ above threshold)
50e-9         FVALUE peak_offset        \ offset for current peak (50 ns)

\ compute current at real time t
:INLINE GaussianPulse ( F: t -- c )
        peak_offset F-
        fwhm F/
        FSQR -2.77066e F* FEXP
        pulse_amp F* dc_current F+ ;

\ =============================
\ Pump rate
\ =============================

\ P = PumpFactor*(I/I_th - 1);
:INLINE pump ( F: c -- )
        I_th F/ F1-
        PumpFactor F* TO P ;

\ ======================================
\ Rates of dimensionless state variables
\ ======================================

\ Data in a state vector has the following order: Re{Y}, Im{Y}, Z

0 VALUE 's
0 VALUE 't

:INLINE t[]! 't []DFLOAT DF! ;

:INLINE intensity^2 ( -- ) ( F: -- I )  \ compute intensity
        0 s[] FSQR  1 s[] FSQR F+ ;

:INLINE intensity ( -- ) ( F: -- I )    \ compute intensity
        intensity^2 FSQRT ;

:INLINE phase ( -- ) ( F: -- phase )    \ compute phase in radians
        1 s[]  0 s[]  FATAN2 ;

\ compute dY/ds for state vector 'a'
\ Re{dY/ds} = Z(Re{Y} - alpha*Im{Y})
\ Im{dY/ds} = Z(alpha*Re{Y} + Im{Y})
: dY/ds ( F: -- )
        0 s[]           1 s[] alpha F*  F-  2 s[] F*  0 t[]!
        0 s[] alpha F*  1 s[]           F+  2 s[] F*  1 t[]! ;

\ compute rate of change of Z
\ dZ/ds = (P - Z - (1 + 2Z)|Y|^2)/T
: dZ/ds ( F: -- )
        intensity^2 2 s[] F2* F1+  F*  
        P  2 s[] F- FSWAP F-  T_ratio F/  2 t[]! ;

\ derivative of the state vector
:NONAME ( u{ dudt{ -- ) ( F: t -- )
        TO 't  TO 's  GaussianPulse pump  dY/ds dZ/ds ; =: 'dsdt

\ ==========
\ ODE Solver
\ ==========

1e FVALUE ds
25000e t_p F* 1e-9 F/  FVALUE endtime

0 VALUE ix

   3 DOUBLE ARRAY V{
#512 DOUBLE ARRAY t{
#512 DOUBLE ARRAY i{    -- intensity
#512 DOUBLE ARRAY p{    -- phase
#512 DOUBLE ARRAY c{    -- current density

: main ( -- )
        CLEAR ix
        init_params
        2e FSQRT V{ 0 } DF!
        0e       V{ 1 } DF!
        0e       V{ 2 } DF!
        'dsdt 3 V{ ds 1e-3 )rk4qc_init
        ds 0e
        BEGIN
            rk4qc_step 0= IF CR ." Time: " F.N1
                                ." s, step: " F.N1
                                ." s, ix = "  ix DEC.
                             TRUE ABORT" Time step too small"  
                       ENDIF
            ix #512 < IF    FDUP      t{ ix } DF!
                            intensity i{ ix } DF!
                            phase     p{ ix } DF!
                            2 s[]     c{ ix } DF!
                   ENDIF
            1 +TO ix  FDUP endtime F>
        UNTIL
        F2DROP rk4qc_done
        CR ." needed " ix DEC. ." steps. " ;






CR TIMER-RESET main .ELAPSED

\ EOF



Wed, 06 Apr 2005 23:22:39 GMT  
 fp benchmark

Quote:

> Here is slbench with routines from the FSL and a plotter-type
> visualization. I'm using the variable step Runge-Kutta ODE solver
> with an eps of 1e-3. Instead of 1.167 seconds it now takes 15 ms
> (on my P54c) to get the (nice-looking) output.

Really nice work! Have you checked its output
against that of the original program? One of the
features of these equations are the "relaxation
oscillations" which occur whenever the drive current
changes abruptly, such as at t=0. These oscillations,
which show up in the computed intensity, are a severe
problem for achieving high bit-rate fiber optics
communications systems.

I would probably make use of the FSL routines more
if a version of the routines for an integrated fp/data
stack were available. Also, I think it would be very
useful to have some "real world" programming examples
that make use of the FSL routines, such as your version
of slbench!

Cheers,
Krishna



Thu, 07 Apr 2005 05:18:12 GMT  
 fp benchmark

Quote:

>> Here is slbench with routines from the FSL and a plotter-type
>> visualization. I'm using the variable step Runge-Kutta ODE solver
>> with an eps of 1e-3. Instead of 1.167 seconds it now takes 15 ms
>> (on my P54c) to get the (nice-looking) output.
>                   Have you checked its output
> against that of the original program? One of the
> features of these equations are the "relaxation
> oscillations" which occur whenever the drive current
> changes abruptly, such as at t=0. These oscillations,
> which show up in the computed intensity, are a severe
> problem for achieving high bit-rate fiber optics
> communications systems.

[ This question caused a lot of work -- I had to write something
  to plot and compare the original and the new versions. This
  was hard because the normalization of the differential
  equations makes interpretation of the 'time-axis' difficult. ]

The outputs are not equal. At t = 0 both are oscillatory, but
between t = 47.5 ns and t = 52.5 ns the new routine (with the
quality-controlled step Runge-Kutta) is smooth where the original
slbench oscillates visibly (when using a fixed ds = 0.1).
Taking a closer look at the original slbench, I note that
for ds = 0.5 the output is almost OK but for lower ds oscillations
start. For ds < 0.1 the output starts to explode and is clearly
wrong.

The new slbench can not be made to oscillate like this (I decreased
eps from 1e-3 to 1e-6).

There is indeed something funny about these laser equations. Even
in the visibly straight parts of the curve there is a very low level
oscillation which prevents rk4qc_step's size from going up. The final
code (25 s) is therefore _vastly_ slower than the original slbench
(0.3 s). But I guess it is much more important that its results
are correct :-)

Quote:
> I would probably make use of the FSL routines more
> if a version of the routines for an integrated fp/data
> stack were available.

As your own testing has shown, an integrated fp/data stack is quite
slow. Adding a software fp stack to an interpreter is trivial and would
certainly not make it much slower (maybe even faster and certainly
it will result in much easier to read and maintain user code!)

Quote:
> Also, I think it would be very
> useful to have some "real world" programming examples
> that make use of the FSL routines, such as your version
> of slbench!

I think you will find there are quite a lot of simple, but not trivial,
examples in the ODE related parts of the FSL. Most 'real-world' examples
would be much too large and complicated in that context.

-marcel



Fri, 08 Apr 2005 07:02:03 GMT  
 fp benchmark

Quote:



> >> Here is slbench with routines from the FSL and a plotter-type
> >> visualization. I'm using the variable step Runge-Kutta ODE solver
> >> with an eps of 1e-3. Instead of 1.167 seconds it now takes 15 ms
> >> (on my P54c) to get the (nice-looking) output.

> >                   Have you checked its output
> > against that of the original program? One of the
> > features of these equations are the "relaxation
> > oscillations" which occur whenever the drive current
> > changes abruptly, such as at t=0. These oscillations,
> > which show up in the computed intensity, are a severe
> > problem for achieving high bit-rate fiber optics
> > communications systems.

> [ This question caused a lot of work -- I had to write something
>   to plot and compare the original and the new versions. This
>   was hard because the normalization of the differential
>   equations makes interpretation of the 'time-axis' difficult. ]

Sorry it caused you so much trouble. But, a rule of thumb from
years of experience is: never believe the initial output of a
modeling program!

Quote:
> The outputs are not equal. At t = 0 both are oscillatory, but
> between t = 47.5 ns and t = 52.5 ns the new routine (with the
> quality-controlled step Runge-Kutta) is smooth where the original
> slbench oscillates visibly (when using a fixed ds = 0.1).
> Taking a closer look at the original slbench, I note that
> for ds = 0.5 the output is almost OK but for lower ds oscillations
> start. For ds < 0.1 the output starts to explode and is clearly
> wrong.

There should be oscillations in the output intensity for the
Gaussian current pulse that occurs between about 48 to 58 ns
into the simulation. The strength of these damped oscillations
which trail the Gaussian portion depends on the parameters used
in the simulation, but they are significant in strength for the
default params. I would not expect the simulation to go bad if
the (normalized) time step, ds, is set below 0.1 --- strange...
I have to check that.

Quote:
> The new slbench can not be made to oscillate like this (I decreased
> eps from 1e-3 to 1e-6).

> There is indeed something funny about these laser equations. Even
> in the visibly straight parts of the curve there is a very low level
> oscillation which prevents rk4qc_step's size from going up. The final
> code (25 s) is therefore _vastly_ slower than the original slbench
> (0.3 s). But I guess it is much more important that its results
> are correct :-)

Amen to your last statement!

Quote:
> > I would probably make use of the FSL routines more
> > if a version of the routines for an integrated fp/data
> > stack were available.

> As your own testing has shown, an integrated fp/data stack is quite
> slow. Adding a software fp stack to an interpreter is trivial and would
> certainly not make it much slower (maybe even faster and certainly
> it will result in much easier to read and maintain user code!)

Actually, my testing showed the opposite to be true for interpreters.
The performance for the integrated fp/data stack was respectably close
to the performance of code using a separate fp stack. The next version
of
pfe will show this more clearly because it will provide a module
for an integrated stack. In early testing, I find less than 30%
difference. The comparisons with gforth, when normalized to the
performance ratio for non-fp code, also bear out the conclusion
that an integrated fp/data stack does not severely degrade performance
for floating point calculations. My experience has been that the
time to fetch and store fp numbers from memory to the fpu stack,
for x86 processors, is the main bottleneck for fp performance. Of
course,
I'm only talking about interpreters here. For native code systems,
such as iForth, I expect that making use of the fpu stack will
improve performance considerably.

Cheers, ... and get some sleep!

Krishna



Fri, 08 Apr 2005 09:09:12 GMT  
 fp benchmark


Quote:

> As your own testing has shown, an integrated fp/data stack is quite
> slow. Adding a software fp stack to an interpreter is trivial and would
> certainly not make it much slower (maybe even faster and certainly
> it will result in much easier to read and maintain user code!)

    At least on a PC, it is slower, but not a lot.  I don't recall the exact
figures from when I move the stack to memory, but it was something like 40%
as I recall.

--

-Gary Chanson (MVP for Windows SDK)
-Software Consultant (Embedded systems and Real Time Controls)

-War is the last resort of the incompetent.



Fri, 08 Apr 2005 12:50:00 GMT  
 fp benchmark


Quote:

> As your own testing has shown, an integrated fp/data stack is quite
> slow. Adding a software fp stack to an interpreter is trivial and would
> certainly not make it much slower (maybe even faster and certainly
> it will result in much easier to read and maintain user code!)

    At least on a PC, it is slower, but not a lot.  I don't remember the
exact figure but when I moved my FP stack to memory, it slowed down but
about 40% as I recall.

--

-Gary Chanson (MVP for Windows SDK)
-Software Consultant (Embedded systems and Real Time Controls)

-War is the last resort of the incompetent.



Fri, 08 Apr 2005 12:22:03 GMT  
 fp benchmark

Quote:



> > As your own testing has shown, an integrated fp/data stack is quite
> > slow. Adding a software fp stack to an interpreter is trivial and would
> > certainly not make it much slower (maybe even faster and certainly
> > it will result in much easier to read and maintain user code!)

>     At least on a PC, it is slower, but not a lot.  I don't remember the
> exact figure but when I moved my FP stack to memory, it slowed down but
> about 40% as I recall.

What exactly are you comparing here? Whether on the normal data stack
or on a special *software* fp stack, the numbers are in memory (on a
PC). Of course, if operators like fswap, fover etc. aren't optimized
to the same degree as 2swap, 2over etc., there is an obvious
slowdown.

-marcel



Fri, 08 Apr 2005 16:01:28 GMT  
 
 [ 61 post ]  Go to page: [1] [2] [3] [4] [5]

 Relevant Pages 

1. More than two FP-TB-3 with FP-TC-120

2. OO and FP (was subtyping in FP)

3. FP Component Libraries? (FP vs OOP)

4. FP to FP Binary/Hex

5. OO and FP [Fwd: comparison OOP and FP (and how to put them together) was: Re: need help with haskell]

6. Using FP registers as additional GP registers

7. FP in a larger scale (Re: Comparison of functional languages)

8. How to sell FP?

9. Benefits of FP

10. Newbie Q: What is FP?

11. Block>>speed [was: Re: How to sell FP ]

12. HELP: FP Calculations clobbered by a pgm's local interrupt handler on PENTIUM, 486

 

 
Powered by phpBB® Forum Software