Possible F77 Code Improvement ?? 
Author Message
 Possible F77 Code Improvement ??

Hello;

I would very much appreciate your help.

I've developed a complex F77 code (with limited F90)  in a piece-wise
fashion during the last ten years or so, and still working on it
whenever my time allows.
The package (compiled with g95) runs quite slow!  It currently takes
hours on a PC 3.16 GHz m/c, but the program works perfectly (until I
find the last bug!)

The package contains 120 - 150 subroutines, one of them is Sub
EFFECT5, which is called 1000s of times by other routines in the
package.

An abbreviated version of Subroutine EFFECT5 is listed below.  It has
3 main loops: 15, 25, 35.
It's reasonable to assume that a code improvement of the sub,
regardless of how insignificant it is, would have an impact on
speeding up the entire package!

You experts would most certainly see many "obvious" ways for
improvement !! ... delete statements, combine expressions, delete
variables, replace the 3 loops, change declarations, etc.

Your suggestions would be greatly appreciated.

Thank you kindly.
Monir

C
**********************************************************************
   SUBROUTINE EFFECT5
C
**********************************************************************
 EXTERNAL FUNC5R, MIDPNT, MIDINF

 REAL    X(62,8), R(62,8), TH(62,8)
 REAL    GAM(62), TANBIN(62)

 double precision ValX(496), ValR(496), ValT(496)
 double precision ValAbsX(496), ValAbsR(496), ValAbsT(496)
 double precision SSX, SSR, SST
 double precision SUMX, SUMR, SUMT
 double precision DX(1080), DR(1080), DT(1080)

 COMMON /DATA02/ DX, DR, DT
 COMMON /DATA03/ X, R, TH
 COMMON /DATA04/ GAM, TANBIN
 COMMON /DATA33/ NPOINTS, XPOINT, RPOINT, THPOINT
 COMMON /DATA55/ IPOINT, JVORT, NB

C IPOINT is the index of point of evaluation in the incompressible
steady rotational flow

 DO 15 IPOINT = 1, NPOINTS  !max 1080
C.......my code 1

     Kount = 0
     DO 35 NB= 1, NB !max 8

           DO 25 JVORT= 1, NF  !max 62
                 Kount = Kount + 1
C.......my code 2
                SSX= valueXX
                SSR= valueRR
                SST= valueTT
C.......my code 3
           ValX(Kount) = SSX*GAM(JVORT)*R(JVORT,1)
                  ValAbsX(Kount)=ABS(ValX(Kount))
           ValR(Kount) = SSR*GAM(JVORT)*R(JVORT,1)
                  ValAbsR(Kount)=ABS(ValR(Kount))
           ValT(Kount) = SST*GAM(JVORT)*R(JVORT,1)
                  ValAbsT(Kount)=ABS(ValT(Kount))

25        CONTINUE

35     CONTINUE

C in order to minimize the round-off errors, sort the absolute values
in
C ascending order, and then sum the actual values in the new order
C
 call sort2d (kount, valAbsX, ValX)
 call sort2d (kount, valAbsR, ValR)
 call sort2d (kount, valAbsT, ValT)

      SUMX = 0.D0
      SUMR = 0.D0
      SUMT = 0.D0
      DO 555 Index = 1, kount
                SUMX = SUMX + ValX(Index)
                SUMR = SUMR + ValR(Index)
                SUMT = SUMT + ValT(Index)
555          CONTINUE

        DX(IPOINT) = 0.5*SUMX
        DR(IPOINT) = 0.5*SUMR
        DT(IPOINT) = 0.5*SUMT

15  CONTINUE     !outermost loop for ipoint

 Return
 END

C *******************************************************************
   SUBROUTINE SORT2D(N, RA, RB)
C *******************************************************************
C
 double precision RA(N), RB(N)

C.......my code 4

 Return
 END



Mon, 22 Aug 2011 05:42:38 GMT  
 Possible F77 Code Improvement ??

Quote:

> Hello;

> I would very much appreciate your help.

> I've developed a complex F77 code (with limited F90)  in a piece-wise
> fashion during the last ten years or so, and still working on it
> whenever my time allows.
> The package (compiled with g95) runs quite slow!  It currently takes
> hours on a PC 3.16 GHz m/c, but the program works perfectly (until I
> find the last bug!)

> The package contains 120 - 150 subroutines, one of them is Sub
> EFFECT5, which is called 1000s of times by other routines in the
> package.

> An abbreviated version of Subroutine EFFECT5 is listed below.  It has
> 3 main loops: 15, 25, 35.
> It's reasonable to assume that a code improvement of the sub,
> regardless of how insignificant it is, would have an impact on
> speeding up the entire package!

> You experts would most certainly see many "obvious" ways for
> improvement !! ... delete statements, combine expressions, delete
> variables, replace the 3 loops, change declarations, etc.

> Your suggestions would be greatly appreciated.

...

Quote:
> C in order to minimize the round-off errors, sort the absolute values
> in
> C ascending order, and then sum the actual values in the new order
> C
> call sort2d (kount, valAbsX, ValX)
> call sort2d (kount, valAbsR, ValR)
> call sort2d (kount, valAbsT, ValT)

A quick look makes me think that this is where your code is spending
most of its time.  Try commenting these lines out and see how it
affects the runtime (and the accuracy).

Cheers,
Rob Komar



Mon, 22 Aug 2011 06:20:00 GMT  
 Possible F77 Code Improvement ??

Quote:

> Hello;

> I would very much appreciate your help.

> I've developed a complex F77 code (with limited F90)  in a piece-wise
> fashion during the last ten years or so, and still working on it
> whenever my time allows.
> The package (compiled with g95) runs quite slow!  It currently takes
> hours on a PC 3.16 GHz m/c, but the program works perfectly (until I
> find the last bug!)

> The package contains 120 - 150 subroutines, one of them is Sub
> EFFECT5, which is called 1000s of times by other routines in the
> package.

> An abbreviated version of Subroutine EFFECT5 is listed below.  It has
> 3 main loops: 15, 25, 35.
> It's reasonable to assume that a code improvement of the sub,
> regardless of how insignificant it is, would have an impact on
> speeding up the entire package!

> You experts would most certainly see many "obvious" ways for
> improvement !! ... delete statements, combine expressions, delete
> variables, replace the 3 loops, change declarations, etc.

> Your suggestions would be greatly appreciated.

A couple of suggestions.  Are you sure EFFECT5 is the big
time consumer?  Have you tried some sort of timing profiler?
Also, have you tried various optimization levels?  I know it
sounds silly, but sometimes when people call tech support
the answer is "plug in the computer."

The only other things that jump out are the summation loop, 555.
If you can do some F95 stuff, try using 3 SUM intrinsics, rather
than the explicit summation loop.  Some times the compiler can
generate better inline code for a SUM.  Other times, doing all
3 in one loop is more efficient; depends on cache hits, instruction
scheduling, number of registers, and black magic.

In real life, I'd also worry about the sort2d routine.  This
version is pretty efficient, but sorting is hard and when you
actually do it you could be spending a lot of time in sort2d.
I'm guessing you're sorting be the absolute value and
carrying along the actual value.  You might try eliminating the
3 ValAbs* arrays and doing the ABS computation in sort2d
on the fly.  It's likely to add tons of absolute value
computations in sort2d.  But, they are likely to be very
fast and well scheduled.  If so, you might eliminate cache
collisions and also make the DO 35 loop more efficient
just because it will use less registers.  You could also
just move the computations of ValAbs* out of the DO 35
loop and put in set of
       ValAbs* = ABS(Val*)
after the loop.  On the chance that the compiler will do
a much more efficient job on the simple intrinsics.

You could also try something called Kahan's method for
doing a summation.  I don't remember how to do it, but
(I think) it's pretty good at summing unsorted values
and controlling roundoff.  That would also eliminate
the need for the ValAbs* arrays and the sort2d calls.

Dick Hendrickson

- Show quoted text -

Quote:

> Thank you kindly.
> Monir

> C
> **********************************************************************
>    SUBROUTINE EFFECT5
> C
> **********************************************************************
>  EXTERNAL FUNC5R, MIDPNT, MIDINF

>  REAL    X(62,8), R(62,8), TH(62,8)
>  REAL    GAM(62), TANBIN(62)

>  double precision ValX(496), ValR(496), ValT(496)
>  double precision ValAbsX(496), ValAbsR(496), ValAbsT(496)
>  double precision SSX, SSR, SST
>  double precision SUMX, SUMR, SUMT
>  double precision DX(1080), DR(1080), DT(1080)

>  COMMON /DATA02/ DX, DR, DT
>  COMMON /DATA03/ X, R, TH
>  COMMON /DATA04/ GAM, TANBIN
>  COMMON /DATA33/ NPOINTS, XPOINT, RPOINT, THPOINT
>  COMMON /DATA55/ IPOINT, JVORT, NB

> C IPOINT is the index of point of evaluation in the incompressible
> steady rotational flow

>  DO 15 IPOINT = 1, NPOINTS  !max 1080
> C.......my code 1

>      Kount = 0
>      DO 35 NB= 1, NB !max 8

>            DO 25 JVORT= 1, NF  !max 62
>                  Kount = Kount + 1
> C.......my code 2
>                 SSX= valueXX
>                 SSR= valueRR
>                 SST= valueTT
> C.......my code 3
>            ValX(Kount) = SSX*GAM(JVORT)*R(JVORT,1)
>                   ValAbsX(Kount)=ABS(ValX(Kount))
>            ValR(Kount) = SSR*GAM(JVORT)*R(JVORT,1)
>                   ValAbsR(Kount)=ABS(ValR(Kount))
>            ValT(Kount) = SST*GAM(JVORT)*R(JVORT,1)
>                   ValAbsT(Kount)=ABS(ValT(Kount))

> 25        CONTINUE

> 35     CONTINUE

> C in order to minimize the round-off errors, sort the absolute values
> in
> C ascending order, and then sum the actual values in the new order
> C
>  call sort2d (kount, valAbsX, ValX)
>  call sort2d (kount, valAbsR, ValR)
>  call sort2d (kount, valAbsT, ValT)

>       SUMX = 0.D0
>       SUMR = 0.D0
>       SUMT = 0.D0
>       DO 555 Index = 1, kount
>                 SUMX = SUMX + ValX(Index)
>                 SUMR = SUMR + ValR(Index)
>                 SUMT = SUMT + ValT(Index)
> 555          CONTINUE

>         DX(IPOINT) = 0.5*SUMX
>         DR(IPOINT) = 0.5*SUMR
>         DT(IPOINT) = 0.5*SUMT

> 15  CONTINUE     !outermost loop for ipoint

>  Return
>  END

> C *******************************************************************
>    SUBROUTINE SORT2D(N, RA, RB)
> C *******************************************************************
> C
>  double precision RA(N), RB(N)

> C.......my code 4

>  Return
>  END



Mon, 22 Aug 2011 06:37:08 GMT  
 Possible F77 Code Improvement ??
(snip)

Quote:
> in order to minimize the round-off errors, sort

 > the absolute values in ascending order, and then
 > sum the actual values in the new order

I know that with one sign it is best to sort and add.

I am not so sure about mixed signs.  I can easily imagine
that a large value of the opposite sign at the end wipes
out all the previously minimized round-off errors.

-- glen



Mon, 22 Aug 2011 06:46:37 GMT  
 Possible F77 Code Improvement ??

Quote:

>  DO 15 IPOINT = 1, NPOINTS  !max 1080
> C.......my code 1

>      Kount = 0
>      DO 35 NB= 1, NB !max 8

>            DO 25 JVORT= 1, NF  !max 62
>                  Kount = Kount + 1
> C.......my code 2
>                 SSX= valueXX
>                 SSR= valueRR
>                 SST= valueTT
> C.......my code 3
>            ValX(Kount) = SSX*GAM(JVORT)*R(JVORT,1)
>                   ValAbsX(Kount)=ABS(ValX(Kount))
>            ValR(Kount) = SSR*GAM(JVORT)*R(JVORT,1)
>                   ValAbsR(Kount)=ABS(ValR(Kount))
>            ValT(Kount) = SST*GAM(JVORT)*R(JVORT,1)
>                   ValAbsT(Kount)=ABS(ValT(Kount))

> 25        CONTINUE

This appears to be vectorizable, for most common CPUs.   If you use a
compiler which isn't capable of vectorization with multiple sum reductions
in one loop, it may be worth splitting it to gain vectorization.
Quote:

> 35     CONTINUE

> C in order to minimize the round-off errors, sort the absolute values
> in
> C ascending order, and then sum the actual values in the new order
> C
>  call sort2d (kount, valAbsX, ValX)
>  call sort2d (kount, valAbsR, ValR)
>  call sort2d (kount, valAbsT, ValT)

It would likely be faster to sum in x87 extra precision instead of
sorting, even though that prevents effective vectorization.

- Show quoted text -

Quote:
>       SUMX = 0.D0
>       SUMR = 0.D0
>       SUMT = 0.D0
>       DO 555 Index = 1, kount
>                 SUMX = SUMX + ValX(Index)
>                 SUMR = SUMR + ValR(Index)
>                 SUMT = SUMT + ValT(Index)
> 555          CONTINUE

>         DX(IPOINT) = 0.5*SUMX
>         DR(IPOINT) = 0.5*SUMR
>         DT(IPOINT) = 0.5*SUMT

> 15  CONTINUE     !outermost loop for ipoint



Mon, 22 Aug 2011 10:52:21 GMT  
 Possible F77 Code Improvement ??

Quote:

> > Hello;

> > I would very much appreciate your help.

> > I've developed a complex F77 code (with limited F90) ?in a piece-wise
> > fashion during the last ten years or so, and still working on it
> > whenever my time allows.
> > The package (compiled with g95) runs quite slow! ?It currently takes
> > hours on a PC 3.16 GHz m/c, but the program works perfectly (until I
> > find the last bug!)

> > The package contains 120 - 150 subroutines, one of them is Sub
> > EFFECT5, which is called 1000s of times by other routines in the
> > package.

> > An abbreviated version of Subroutine EFFECT5 is listed below. ?It has
> > 3 main loops: 15, 25, 35.
> > It's reasonable to assume that a code improvement of the sub,
> > regardless of how insignificant it is, would have an impact on
> > speeding up the entire package!

> > You experts would most certainly see many "obvious" ways for
> > improvement !! ... delete statements, combine expressions, delete
> > variables, replace the 3 loops, change declarations, etc.

> > Your suggestions would be greatly appreciated.

> ...

> > C in order to minimize the round-off errors, sort the absolute values
> > in
> > C ascending order, and then sum the actual values in the new order
> > C
> > call sort2d (kount, valAbsX, ValX)
> > call sort2d (kount, valAbsR, ValR)
> > call sort2d (kount, valAbsT, ValT)

> A quick look makes me think that this is where your code is spending
> most of its time. ?Try commenting these lines out and see how it
> affects the runtime (and the accuracy).

> Cheers,
> Rob Komar- Hide quoted text -

> - Show quoted text -

Rob;

Yes.  Commenting out Call sort2d lines positively affects the runtime
and adversely affects the accuracy.

Regards.
Monir



Mon, 22 Aug 2011 12:37:16 GMT  
 Possible F77 Code Improvement ??

Quote:

> > Hello;

> > I would very much appreciate your help.

> > I've developed a complex F77 code (with limited F90) ?in a piece-wise
> > fashion during the last ten years or so, and still working on it
> > whenever my time allows.
> > The package (compiled with g95) runs quite slow! ?It currently takes
> > hours on a PC 3.16 GHz m/c, but the program works perfectly (until I
> > find the last bug!)

> > The package contains 120 - 150 subroutines, one of them is Sub
> > EFFECT5, which is called 1000s of times by other routines in the
> > package.

> > An abbreviated version of Subroutine EFFECT5 is listed below. ?It has
> > 3 main loops: 15, 25, 35.
> > It's reasonable to assume that a code improvement of the sub,
> > regardless of how insignificant it is, would have an impact on
> > speeding up the entire package!

> > You experts would most certainly see many "obvious" ways for
> > improvement !! ... delete statements, combine expressions, delete
> > variables, replace the 3 loops, change declarations, etc.

> > Your suggestions would be greatly appreciated.

> A couple of suggestions. ?Are you sure EFFECT5 is the big
> time consumer? ?Have you tried some sort of timing profiler?
> Also, have you tried various optimization levels? ?I know it
> sounds silly, but sometimes when people call tech support
> the answer is "plug in the computer."

> The only other things that jump out are the summation loop, 555.
> If you can do some F95 stuff, try using 3 SUM intrinsics, rather
> than the explicit summation loop. ?Some times the compiler can
> generate better inline code for a SUM. ?Other times, doing all
> 3 in one loop is more efficient; depends on cache hits, instruction
> scheduling, number of registers, and black magic.

> In real life, I'd also worry about the sort2d routine. ?This
> version is pretty efficient, but sorting is hard and when you
> actually do it you could be spending a lot of time in sort2d.
> I'm guessing you're sorting be the absolute value and
> carrying along the actual value. ?You might try eliminating the
> 3 ValAbs* arrays and doing the ABS computation in sort2d
> on the fly. ?It's likely to add tons of absolute value
> computations in sort2d. ?But, they are likely to be very
> fast and well scheduled. ?If so, you might eliminate cache
> collisions and also make the DO 35 loop more efficient
> just because it will use less registers. ?You could also
> just move the computations of ValAbs* out of the DO 35
> loop and put in set of
> ? ? ? ?ValAbs* = ABS(Val*)
> after the loop. ?On the chance that the compiler will do
> a much more efficient job on the simple intrinsics.

> You could also try something called Kahan's method for
> doing a summation. ?I don't remember how to do it, but
> (I think) it's pretty good at summing unsorted values
> and controlling roundoff. ?That would also eliminate
> the need for the ValAbs* arrays and the sort2d calls.

>{*filter*} Hendrickson

> > Thank you kindly.
> > Monir

> > C
> > **********************************************************************
> > ? ?SUBROUTINE EFFECT5
> > C
> > **********************************************************************
> > ?EXTERNAL FUNC5R, MIDPNT, MIDINF

> > ?REAL ? ?X(62,8), R(62,8), TH(62,8)
> > ?REAL ? ?GAM(62), TANBIN(62)

> > ?double precision ValX(496), ValR(496), ValT(496)
> > ?double precision ValAbsX(496), ValAbsR(496), ValAbsT(496)
> > ?double precision SSX, SSR, SST
> > ?double precision SUMX, SUMR, SUMT
> > ?double precision DX(1080), DR(1080), DT(1080)

> > ?COMMON /DATA02/ DX, DR, DT
> > ?COMMON /DATA03/ X, R, TH
> > ?COMMON /DATA04/ GAM, TANBIN
> > ?COMMON /DATA33/ NPOINTS, XPOINT, RPOINT, THPOINT
> > ?COMMON /DATA55/ IPOINT, JVORT, NB

> > C IPOINT is the index of point of evaluation in the incompressible
> > steady rotational flow

> > ?DO 15 IPOINT = 1, NPOINTS ?!max 1080
> > C.......my code 1

> > ? ? ?Kount = 0
> > ? ? ?DO 35 NB= 1, NB !max 8

> > ? ? ? ? ? ?DO 25 JVORT= 1, NF ?!max 62
> > ? ? ? ? ? ? ? ? ?Kount = Kount + 1
> > C.......my code 2
> > ? ? ? ? ? ? ? ? SSX= valueXX
> > ? ? ? ? ? ? ? ? SSR= valueRR
> > ? ? ? ? ? ? ? ? SST= valueTT
> > C.......my code 3
> > ? ? ? ? ? ?ValX(Kount) = SSX*GAM(JVORT)*R(JVORT,1)
> > ? ? ? ? ? ? ? ? ? ValAbsX(Kount)=ABS(ValX(Kount))
> > ? ? ? ? ? ?ValR(Kount) = SSR*GAM(JVORT)*R(JVORT,1)
> > ? ? ? ? ? ? ? ? ? ValAbsR(Kount)=ABS(ValR(Kount))
> > ? ? ? ? ? ?ValT(Kount) = SST*GAM(JVORT)*R(JVORT,1)
> > ? ? ? ? ? ? ? ? ? ValAbsT(Kount)=ABS(ValT(Kount))

> > 25 ? ? ? ?CONTINUE

> > 35 ? ? CONTINUE

> > C in order to minimize the round-off errors, sort the absolute values
> > in
> > C ascending order, and then sum the actual values in the new order
> > C
> > ?call sort2d (kount, valAbsX, ValX)
> > ?call sort2d (kount, valAbsR, ValR)
> > ?call sort2d (kount, valAbsT, ValT)

> > ? ? ? SUMX = 0.D0
> > ? ? ? SUMR = 0.D0
> > ? ? ? SUMT = 0.D0
> > ? ? ? DO 555 Index = 1, kount
> > ? ? ? ? ? ? ? ? SUMX = SUMX + ValX(Index)
> > ? ? ? ? ? ? ? ? SUMR = SUMR + ValR(Index)
> > ? ? ? ? ? ? ? ? SUMT = SUMT + ValT(Index)
> > 555 ? ? ? ? ?CONTINUE

> > ? ? ? ? DX(IPOINT) = 0.5*SUMX
> > ? ? ? ? DR(IPOINT) = 0.5*SUMR
> > ? ? ? ? DT(IPOINT) = 0.5*SUMT

> > 15 ?CONTINUE ? ? !outermost loop for ipoint

> > ?Return
> > ?END

> > C *******************************************************************
> > ? ?SUBROUTINE SORT2D(N, RA, RB)
> > C *******************************************************************
> > C
> > ?double precision RA(N), RB(N)

> > C.......my code 4

> > ?Return
> > ?END- Hide quoted text -

> - Show quoted text -- Hide quoted text -

> - Show quoted text -

Dick;

Thank you for your prompt reply and helpful suggestions.

I'm sure EFFECT5 is the culprit.  No doubt its subsidiary routine
sort2d and the summation loop 555 consume a lot of time.
The values involved in each sum vary from extremely large positive/
negative values to extremely small positive/negative values.  In fact,
some of the values are "almost" equal with opposite sign, which can
result in serious loss of precision.

To achieve a reasonable level of precision following a series of
numerical experimentation, I decided a while back to sort the values
before the summation, and hence the routine sort2d and loop 555.  The
results correlate extremely well with other theories.

I've used F90 sporadically, but never used F95, so I'm not familiar,
for example, with the use of "fortran SUM Intrinsic Function", even
though I use the Intrinsic SUM in other programming languages (VBA).
I reckon for a such use one would have to take extra care in declaring
the exact arrays size, modify the common data blocks, adjust the
equivalence statements (nightmare!), etc. to comply with the F95
instructions.

You suggested the possibility of "... doing all 3 in one loop is more
efficient ..."
Would you be kind enough to elaborate on how ??  Maybe just couple of
code lines to get the idea.

I don't remember the "Kahan's method for summation".  Could it be that
you meant "Knuth" method ??
In any event, I suppose the method would require a routine to replace
sort2d.  Correct ??  Will check.

Thanks again for your time and help.
Monir



Mon, 22 Aug 2011 12:38:43 GMT  
 Possible F77 Code Improvement ??

Quote:

> (snip)

> > in order to minimize the round-off errors, sort

> ?> the absolute values in ascending order, and then
> ?> sum the actual values in the new order

> I know that with one sign it is best to sort and add.

> I am not so sure about mixed signs. ?I can easily imagine
> that a large value of the opposite sign at the end wipes
> out all the previously minimized round-off errors.

> -- glen

Glen;

I believe it is the subtraction of almost equal small values that can
seriously affect the precision of the calculations.
Routine sort2d is based on N.R. Heapsort method and has been tested
extensively in the summation of very large arrays of mixed elements
ranging from extremely large to extremely small values.

Regards.
Monir



Mon, 22 Aug 2011 12:40:19 GMT  
 Possible F77 Code Improvement ??
(snip)

Quote:
> I don't remember the "Kahan's method for summation". t
> Could it be tha you meant "Knuth" method ??

http://en.wikipedia.org/wiki/Kahan_summation_algorithm#The_algorithm

Quote:
> In any event, I suppose the method would require a
> routine to replace sort2d.  Correct ??  Will check.

http://en.wikipedia.org/wiki/Kahan_summation_algorithm#Alternatives

I think the way the suggested sorted sum method works for values
of both signs would be something like:

Sort the positive numbers.
Sort the negative numbers.

Start with the one with the smallest absolute value, and add
it to the sum.  Continue adding numbers from the positive list
or negative list, whichever keeps the absolute value of the
sum the smallest.

If there aren't any more in one of the lists, just add the
rest from the other list.

It isn't so obvious to me that sorting by absolute value
does that, but then it isn't obvious that it doesn't work,
either.

-- glen



Mon, 22 Aug 2011 13:24:26 GMT  
 Possible F77 Code Improvement ??

Quote:

> (snip)

> > I don't remember the "Kahan's method for summation". t
> > Could it be tha you meant "Knuth" method ??

> http://en.wikipedia.org/wiki/Kahan_summation_algorithm#The_algorithm

> > In any event, I suppose the method would require a
> > routine to replace sort2d. ?Correct ?? ?Will check.

> http://en.wikipedia.org/wiki/Kahan_summation_algorithm#Alternatives

> I think the way the suggested sorted sum method works for values
> of both signs would be something like:

> Sort the positive numbers.
> Sort the negative numbers.

> Start with the one with the smallest absolute value, and add
> it to the sum. ?Continue adding numbers from the positive list
> or negative list, whichever keeps the absolute value of the
> sum the smallest.

> If there aren't any more in one of the lists, just add the
> rest from the other list.

> It isn't so obvious to me that sorting by absolute value
> does that, but then it isn't obvious that it doesn't work,
> either.

> -- glen

Thanks Glen.  Will check both sites.

Regards.
Monir



Mon, 22 Aug 2011 13:39:08 GMT  
 Possible F77 Code Improvement ??

Quote:

> > ?DO 15 IPOINT = 1, NPOINTS ?!max 1080
> > C.......my code 1

> > ? ? ?Kount = 0
> > ? ? ?DO 35 NB= 1, NB !max 8

> > ? ? ? ? ? ?DO 25 JVORT= 1, NF ?!max 62
> > ? ? ? ? ? ? ? ? ?Kount = Kount + 1
> > C.......my code 2
> > ? ? ? ? ? ? ? ? SSX= valueXX
> > ? ? ? ? ? ? ? ? SSR= valueRR
> > ? ? ? ? ? ? ? ? SST= valueTT
> > C.......my code 3
> > ? ? ? ? ? ?ValX(Kount) = SSX*GAM(JVORT)*R(JVORT,1)
> > ? ? ? ? ? ? ? ? ? ValAbsX(Kount)=ABS(ValX(Kount))
> > ? ? ? ? ? ?ValR(Kount) = SSR*GAM(JVORT)*R(JVORT,1)
> > ? ? ? ? ? ? ? ? ? ValAbsR(Kount)=ABS(ValR(Kount))
> > ? ? ? ? ? ?ValT(Kount) = SST*GAM(JVORT)*R(JVORT,1)
> > ? ? ? ? ? ? ? ? ? ValAbsT(Kount)=ABS(ValT(Kount))

> > 25 ? ? ? ?CONTINUE

> This appears to be vectorizable, for most common CPUs. ? If you use a
> compiler which isn't capable of vectorization with multiple sum reductions
> in one loop, it may be worth splitting it to gain vectorization.

> > 35 ? ? CONTINUE

> > C in order to minimize the round-off errors, sort the absolute values
> > in
> > C ascending order, and then sum the actual values in the new order
> > C
> > ?call sort2d (kount, valAbsX, ValX)
> > ?call sort2d (kount, valAbsR, ValR)
> > ?call sort2d (kount, valAbsT, ValT)

> It would likely be faster to sum in x87 extra precision instead of
> sorting, even though that prevents effective vectorization.

> > ? ? ? SUMX = 0.D0
> > ? ? ? SUMR = 0.D0
> > ? ? ? SUMT = 0.D0
> > ? ? ? DO 555 Index = 1, kount
> > ? ? ? ? ? ? ? ? SUMX = SUMX + ValX(Index)
> > ? ? ? ? ? ? ? ? SUMR = SUMR + ValR(Index)
> > ? ? ? ? ? ? ? ? SUMT = SUMT + ValT(Index)
> > 555 ? ? ? ? ?CONTINUE

> > ? ? ? ? DX(IPOINT) = 0.5*SUMX
> > ? ? ? ? DR(IPOINT) = 0.5*SUMR
> > ? ? ? ? DT(IPOINT) = 0.5*SUMT

> > 15 ?CONTINUE ? ? !outermost loop for ipoint- Hide quoted text -

> - Show quoted text -- Hide quoted text -

> - Show quoted text -

Tim;

Quote:
>"It would likely be faster to sum in x87 extra precision instead of sorting"

Do you mean using higher than double precision ??
Can I do that with F77/g95 compiler ??

Regards.
Monir



Mon, 22 Aug 2011 13:48:23 GMT  
 Possible F77 Code Improvement ??

(someone wrote)

Quote:
>>"It would likely be faster to sum in x87 extra precision instead of sorting"
> Do you mean using higher than double precision ??
> Can I do that with F77/g95 compiler ??

Sometimes it happens naturally, whether you want it
or not.  Sometimes that is good, other times not.

-- glen



Mon, 22 Aug 2011 14:05:17 GMT  
 Possible F77 Code Improvement ??
What is this?

Quote:
>  COMMON /DATA55/ IPOINT, JVORT, NB
> ...
>      DO 35 NB= 1, NB !max 8

Using the upper limit of a DO loop for the loop index of the same loop?
If this is in your real code, then malfunction such as absurdly long
execution time would be expected.

--Dave



Mon, 22 Aug 2011 15:36:49 GMT  
 Possible F77 Code Improvement ??
I think the sort is what is hurting your time most.
Anothe may vbe using extra precision calculation in some place where
you don't need it
I'ma lso assuming the compiler will optimise some of your code that
has repeated terms.
.
I know why you do this sorting, but here's my main points.

1) If the results of the summations you are performing are expected to
be LARGE, it won't have much effect but if the result should be SMALL,
then DO NOT separate the positives and negative values before summing
because in this particular case you could LOSE precision unless you
peformed the summation with still double further precision!.
It's the old rule of not subtracting two large numbers

2) If you do a sort, do it the most efficient way possible.



Mon, 22 Aug 2011 16:36:51 GMT  
 Possible F77 Code Improvement ??

Quote:
> I'm sure EFFECT5 is the culprit.

You're "sure", as in "I've got profiling data", or "just because"
(which tends to be worthless)?

Quote:
>  No doubt

"No doubt"? I'm beginning to suspect you actually haven't profiled
your program, which is the first thing to do in any optimization
exercise. As an aside, modern sampling profilers can profile optimized
code with very low overhead.

Quote:
> its subsidiary routine
> sort2d and the summation loop 555 consume a lot of time.

You previously mentioned that sort2d uses heapsort, which is
O(N*log(N)), whereas the summation is O(N), so obviously for large N
the sorting time will dominate.

By using the Kahan summation algorithm, you might be able to get rid
of the sorting without suffering in accuracy. Or a cheaper way might
be to declare the sum accumulator as extended precision.

Quote:
> I've used F90 sporadically, but never used F95, so I'm not familiar,
> for example, with the use of "Fortran SUM Intrinsic Function", even
> though I use the Intrinsic SUM in other programming languages (VBA).
> I reckon for a such use one would have to take extra care in declaring
> the exact arrays size, modify the common data blocks, adjust the
> equivalence statements (nightmare!), etc. to comply with the F95
> instructions.

Maybe, if your program uses all kinds of nonstandard tricks. For
standard conforming code, the F95 SUM intrinsic should work out of the
box.

Quote:
> You suggested the possibility of "... doing all 3 in one loop is more
> efficient ..."
> Would you be kind enough to elaborate on how ??

Looping over one array at a time might give better cache behaviour. Or
then it that won't matter, and your fused loops are better due to
providing the CPU with more opportunities for parallelism. As
previously mentioned, profiling is the tool you can use to determine
which is better in your case.

--
JB



Mon, 22 Aug 2011 17:26:32 GMT  
 
 [ 136 post ]  Go to page: [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

 Relevant Pages 

1. Python 1.5b1, code review for possible performance improvements

2. code improvement (was:Re: Re: ANN: Code AmeliorationContest (presented by Ruby Conference 2001))

3. code improvement (was:Re: Re: ANN: Code Amelioration Contest (presented by Ruby Conference 2001))

4. Possible UNIX-like improvement in ISE Eiffel tool set

5. F2007 possible improvement

6. Is it possible to generate code (as C or C++ code) from a LabView diagram

7. Ruby Cookbook Improvements (was Ruby code: the lost generation)

8. Next amusing problem: talking integers (was Re: Code sample for improvement)

9. Code sample for improvement

10. Lame code needs improvement

11. RFC: improvements to 'glob' (code included)

 

 
Powered by phpBB® Forum Software