Possible F77 Code Improvement ??
Author 
Message 
moni #1 / 136

Possible F77 Code Improvement ??
Hello; I would very much appreciate your help. I've developed a complex F77 code (with limited F90) in a piecewise fashion during the last ten years or so, and still working on it whenever my time allows. The package (compiled with g95) runs quite slow! It currently takes hours on a PC 3.16 GHz m/c, but the program works perfectly (until I find the last bug!) The package contains 120  150 subroutines, one of them is Sub EFFECT5, which is called 1000s of times by other routines in the package. An abbreviated version of Subroutine EFFECT5 is listed below. It has 3 main loops: 15, 25, 35. It's reasonable to assume that a code improvement of the sub, regardless of how insignificant it is, would have an impact on speeding up the entire package! You experts would most certainly see many "obvious" ways for improvement !! ... delete statements, combine expressions, delete variables, replace the 3 loops, change declarations, etc. Your suggestions would be greatly appreciated. Thank you kindly. Monir C ********************************************************************** SUBROUTINE EFFECT5 C ********************************************************************** EXTERNAL FUNC5R, MIDPNT, MIDINF REAL X(62,8), R(62,8), TH(62,8) REAL GAM(62), TANBIN(62) double precision ValX(496), ValR(496), ValT(496) double precision ValAbsX(496), ValAbsR(496), ValAbsT(496) double precision SSX, SSR, SST double precision SUMX, SUMR, SUMT double precision DX(1080), DR(1080), DT(1080) COMMON /DATA02/ DX, DR, DT COMMON /DATA03/ X, R, TH COMMON /DATA04/ GAM, TANBIN COMMON /DATA33/ NPOINTS, XPOINT, RPOINT, THPOINT COMMON /DATA55/ IPOINT, JVORT, NB C IPOINT is the index of point of evaluation in the incompressible steady rotational flow DO 15 IPOINT = 1, NPOINTS !max 1080 C.......my code 1 Kount = 0 DO 35 NB= 1, NB !max 8 DO 25 JVORT= 1, NF !max 62 Kount = Kount + 1 C.......my code 2 SSX= valueXX SSR= valueRR SST= valueTT C.......my code 3 ValX(Kount) = SSX*GAM(JVORT)*R(JVORT,1) ValAbsX(Kount)=ABS(ValX(Kount)) ValR(Kount) = SSR*GAM(JVORT)*R(JVORT,1) ValAbsR(Kount)=ABS(ValR(Kount)) ValT(Kount) = SST*GAM(JVORT)*R(JVORT,1) ValAbsT(Kount)=ABS(ValT(Kount)) 25 CONTINUE 35 CONTINUE C in order to minimize the roundoff errors, sort the absolute values in C ascending order, and then sum the actual values in the new order C call sort2d (kount, valAbsX, ValX) call sort2d (kount, valAbsR, ValR) call sort2d (kount, valAbsT, ValT) SUMX = 0.D0 SUMR = 0.D0 SUMT = 0.D0 DO 555 Index = 1, kount SUMX = SUMX + ValX(Index) SUMR = SUMR + ValR(Index) SUMT = SUMT + ValT(Index) 555 CONTINUE DX(IPOINT) = 0.5*SUMX DR(IPOINT) = 0.5*SUMR DT(IPOINT) = 0.5*SUMT 15 CONTINUE !outermost loop for ipoint Return END C ******************************************************************* SUBROUTINE SORT2D(N, RA, RB) C ******************************************************************* C double precision RA(N), RB(N) C.......my code 4 Return END

Mon, 22 Aug 2011 05:42:38 GMT 


Robert Koma #2 / 136

Possible F77 Code Improvement ??
Quote:
> Hello; > I would very much appreciate your help. > I've developed a complex F77 code (with limited F90) in a piecewise > fashion during the last ten years or so, and still working on it > whenever my time allows. > The package (compiled with g95) runs quite slow! It currently takes > hours on a PC 3.16 GHz m/c, but the program works perfectly (until I > find the last bug!) > The package contains 120  150 subroutines, one of them is Sub > EFFECT5, which is called 1000s of times by other routines in the > package. > An abbreviated version of Subroutine EFFECT5 is listed below. It has > 3 main loops: 15, 25, 35. > It's reasonable to assume that a code improvement of the sub, > regardless of how insignificant it is, would have an impact on > speeding up the entire package! > You experts would most certainly see many "obvious" ways for > improvement !! ... delete statements, combine expressions, delete > variables, replace the 3 loops, change declarations, etc. > Your suggestions would be greatly appreciated.
... Quote: > C in order to minimize the roundoff errors, sort the absolute values > in > C ascending order, and then sum the actual values in the new order > C > call sort2d (kount, valAbsX, ValX) > call sort2d (kount, valAbsR, ValR) > call sort2d (kount, valAbsT, ValT)
A quick look makes me think that this is where your code is spending most of its time. Try commenting these lines out and see how it affects the runtime (and the accuracy). Cheers, Rob Komar

Mon, 22 Aug 2011 06:20:00 GMT 


Dick Hendrickso #3 / 136

Possible F77 Code Improvement ??
Quote:
> Hello; > I would very much appreciate your help. > I've developed a complex F77 code (with limited F90) in a piecewise > fashion during the last ten years or so, and still working on it > whenever my time allows. > The package (compiled with g95) runs quite slow! It currently takes > hours on a PC 3.16 GHz m/c, but the program works perfectly (until I > find the last bug!) > The package contains 120  150 subroutines, one of them is Sub > EFFECT5, which is called 1000s of times by other routines in the > package. > An abbreviated version of Subroutine EFFECT5 is listed below. It has > 3 main loops: 15, 25, 35. > It's reasonable to assume that a code improvement of the sub, > regardless of how insignificant it is, would have an impact on > speeding up the entire package! > You experts would most certainly see many "obvious" ways for > improvement !! ... delete statements, combine expressions, delete > variables, replace the 3 loops, change declarations, etc. > Your suggestions would be greatly appreciated.
A couple of suggestions. Are you sure EFFECT5 is the big time consumer? Have you tried some sort of timing profiler? Also, have you tried various optimization levels? I know it sounds silly, but sometimes when people call tech support the answer is "plug in the computer." The only other things that jump out are the summation loop, 555. If you can do some F95 stuff, try using 3 SUM intrinsics, rather than the explicit summation loop. Some times the compiler can generate better inline code for a SUM. Other times, doing all 3 in one loop is more efficient; depends on cache hits, instruction scheduling, number of registers, and black magic. In real life, I'd also worry about the sort2d routine. This version is pretty efficient, but sorting is hard and when you actually do it you could be spending a lot of time in sort2d. I'm guessing you're sorting be the absolute value and carrying along the actual value. You might try eliminating the 3 ValAbs* arrays and doing the ABS computation in sort2d on the fly. It's likely to add tons of absolute value computations in sort2d. But, they are likely to be very fast and well scheduled. If so, you might eliminate cache collisions and also make the DO 35 loop more efficient just because it will use less registers. You could also just move the computations of ValAbs* out of the DO 35 loop and put in set of ValAbs* = ABS(Val*) after the loop. On the chance that the compiler will do a much more efficient job on the simple intrinsics. You could also try something called Kahan's method for doing a summation. I don't remember how to do it, but (I think) it's pretty good at summing unsorted values and controlling roundoff. That would also eliminate the need for the ValAbs* arrays and the sort2d calls. Dick Hendrickson Quote: > Thank you kindly. > Monir > C > ********************************************************************** > SUBROUTINE EFFECT5 > C > ********************************************************************** > EXTERNAL FUNC5R, MIDPNT, MIDINF > REAL X(62,8), R(62,8), TH(62,8) > REAL GAM(62), TANBIN(62) > double precision ValX(496), ValR(496), ValT(496) > double precision ValAbsX(496), ValAbsR(496), ValAbsT(496) > double precision SSX, SSR, SST > double precision SUMX, SUMR, SUMT > double precision DX(1080), DR(1080), DT(1080) > COMMON /DATA02/ DX, DR, DT > COMMON /DATA03/ X, R, TH > COMMON /DATA04/ GAM, TANBIN > COMMON /DATA33/ NPOINTS, XPOINT, RPOINT, THPOINT > COMMON /DATA55/ IPOINT, JVORT, NB > C IPOINT is the index of point of evaluation in the incompressible > steady rotational flow > DO 15 IPOINT = 1, NPOINTS !max 1080 > C.......my code 1 > Kount = 0 > DO 35 NB= 1, NB !max 8 > DO 25 JVORT= 1, NF !max 62 > Kount = Kount + 1 > C.......my code 2 > SSX= valueXX > SSR= valueRR > SST= valueTT > C.......my code 3 > ValX(Kount) = SSX*GAM(JVORT)*R(JVORT,1) > ValAbsX(Kount)=ABS(ValX(Kount)) > ValR(Kount) = SSR*GAM(JVORT)*R(JVORT,1) > ValAbsR(Kount)=ABS(ValR(Kount)) > ValT(Kount) = SST*GAM(JVORT)*R(JVORT,1) > ValAbsT(Kount)=ABS(ValT(Kount)) > 25 CONTINUE > 35 CONTINUE > C in order to minimize the roundoff errors, sort the absolute values > in > C ascending order, and then sum the actual values in the new order > C > call sort2d (kount, valAbsX, ValX) > call sort2d (kount, valAbsR, ValR) > call sort2d (kount, valAbsT, ValT) > SUMX = 0.D0 > SUMR = 0.D0 > SUMT = 0.D0 > DO 555 Index = 1, kount > SUMX = SUMX + ValX(Index) > SUMR = SUMR + ValR(Index) > SUMT = SUMT + ValT(Index) > 555 CONTINUE > DX(IPOINT) = 0.5*SUMX > DR(IPOINT) = 0.5*SUMR > DT(IPOINT) = 0.5*SUMT > 15 CONTINUE !outermost loop for ipoint > Return > END > C ******************************************************************* > SUBROUTINE SORT2D(N, RA, RB) > C ******************************************************************* > C > double precision RA(N), RB(N) > C.......my code 4 > Return > END

Mon, 22 Aug 2011 06:37:08 GMT 


Glen Herrmannsfeld #4 / 136

Possible F77 Code Improvement ??
(snip) Quote: > in order to minimize the roundoff errors, sort
> the absolute values in ascending order, and then > sum the actual values in the new order I know that with one sign it is best to sort and add. I am not so sure about mixed signs. I can easily imagine that a large value of the opposite sign at the end wipes out all the previously minimized roundoff errors.  glen

Mon, 22 Aug 2011 06:46:37 GMT 


Tim Princ #5 / 136

Possible F77 Code Improvement ??
Quote:
> DO 15 IPOINT = 1, NPOINTS !max 1080 > C.......my code 1 > Kount = 0 > DO 35 NB= 1, NB !max 8 > DO 25 JVORT= 1, NF !max 62 > Kount = Kount + 1 > C.......my code 2 > SSX= valueXX > SSR= valueRR > SST= valueTT > C.......my code 3 > ValX(Kount) = SSX*GAM(JVORT)*R(JVORT,1) > ValAbsX(Kount)=ABS(ValX(Kount)) > ValR(Kount) = SSR*GAM(JVORT)*R(JVORT,1) > ValAbsR(Kount)=ABS(ValR(Kount)) > ValT(Kount) = SST*GAM(JVORT)*R(JVORT,1) > ValAbsT(Kount)=ABS(ValT(Kount)) > 25 CONTINUE
This appears to be vectorizable, for most common CPUs. If you use a compiler which isn't capable of vectorization with multiple sum reductions in one loop, it may be worth splitting it to gain vectorization. Quote: > 35 CONTINUE > C in order to minimize the roundoff errors, sort the absolute values > in > C ascending order, and then sum the actual values in the new order > C > call sort2d (kount, valAbsX, ValX) > call sort2d (kount, valAbsR, ValR) > call sort2d (kount, valAbsT, ValT)
It would likely be faster to sum in x87 extra precision instead of sorting, even though that prevents effective vectorization. Quote: > SUMX = 0.D0 > SUMR = 0.D0 > SUMT = 0.D0 > DO 555 Index = 1, kount > SUMX = SUMX + ValX(Index) > SUMR = SUMR + ValR(Index) > SUMT = SUMT + ValT(Index) > 555 CONTINUE > DX(IPOINT) = 0.5*SUMX > DR(IPOINT) = 0.5*SUMR > DT(IPOINT) = 0.5*SUMT > 15 CONTINUE !outermost loop for ipoint

Mon, 22 Aug 2011 10:52:21 GMT 


moni #6 / 136

Possible F77 Code Improvement ??
Quote:
> > Hello; > > I would very much appreciate your help. > > I've developed a complex F77 code (with limited F90) ?in a piecewise > > fashion during the last ten years or so, and still working on it > > whenever my time allows. > > The package (compiled with g95) runs quite slow! ?It currently takes > > hours on a PC 3.16 GHz m/c, but the program works perfectly (until I > > find the last bug!) > > The package contains 120  150 subroutines, one of them is Sub > > EFFECT5, which is called 1000s of times by other routines in the > > package. > > An abbreviated version of Subroutine EFFECT5 is listed below. ?It has > > 3 main loops: 15, 25, 35. > > It's reasonable to assume that a code improvement of the sub, > > regardless of how insignificant it is, would have an impact on > > speeding up the entire package! > > You experts would most certainly see many "obvious" ways for > > improvement !! ... delete statements, combine expressions, delete > > variables, replace the 3 loops, change declarations, etc. > > Your suggestions would be greatly appreciated. > ... > > C in order to minimize the roundoff errors, sort the absolute values > > in > > C ascending order, and then sum the actual values in the new order > > C > > call sort2d (kount, valAbsX, ValX) > > call sort2d (kount, valAbsR, ValR) > > call sort2d (kount, valAbsT, ValT) > A quick look makes me think that this is where your code is spending > most of its time. ?Try commenting these lines out and see how it > affects the runtime (and the accuracy). > Cheers, > Rob Komar Hide quoted text  >  Show quoted text 
Rob; Yes. Commenting out Call sort2d lines positively affects the runtime and adversely affects the accuracy. Regards. Monir

Mon, 22 Aug 2011 12:37:16 GMT 


moni #7 / 136

Possible F77 Code Improvement ??
Quote:
> > Hello; > > I would very much appreciate your help. > > I've developed a complex F77 code (with limited F90) ?in a piecewise > > fashion during the last ten years or so, and still working on it > > whenever my time allows. > > The package (compiled with g95) runs quite slow! ?It currently takes > > hours on a PC 3.16 GHz m/c, but the program works perfectly (until I > > find the last bug!) > > The package contains 120  150 subroutines, one of them is Sub > > EFFECT5, which is called 1000s of times by other routines in the > > package. > > An abbreviated version of Subroutine EFFECT5 is listed below. ?It has > > 3 main loops: 15, 25, 35. > > It's reasonable to assume that a code improvement of the sub, > > regardless of how insignificant it is, would have an impact on > > speeding up the entire package! > > You experts would most certainly see many "obvious" ways for > > improvement !! ... delete statements, combine expressions, delete > > variables, replace the 3 loops, change declarations, etc. > > Your suggestions would be greatly appreciated. > A couple of suggestions. ?Are you sure EFFECT5 is the big > time consumer? ?Have you tried some sort of timing profiler? > Also, have you tried various optimization levels? ?I know it > sounds silly, but sometimes when people call tech support > the answer is "plug in the computer." > The only other things that jump out are the summation loop, 555. > If you can do some F95 stuff, try using 3 SUM intrinsics, rather > than the explicit summation loop. ?Some times the compiler can > generate better inline code for a SUM. ?Other times, doing all > 3 in one loop is more efficient; depends on cache hits, instruction > scheduling, number of registers, and black magic. > In real life, I'd also worry about the sort2d routine. ?This > version is pretty efficient, but sorting is hard and when you > actually do it you could be spending a lot of time in sort2d. > I'm guessing you're sorting be the absolute value and > carrying along the actual value. ?You might try eliminating the > 3 ValAbs* arrays and doing the ABS computation in sort2d > on the fly. ?It's likely to add tons of absolute value > computations in sort2d. ?But, they are likely to be very > fast and well scheduled. ?If so, you might eliminate cache > collisions and also make the DO 35 loop more efficient > just because it will use less registers. ?You could also > just move the computations of ValAbs* out of the DO 35 > loop and put in set of > ? ? ? ?ValAbs* = ABS(Val*) > after the loop. ?On the chance that the compiler will do > a much more efficient job on the simple intrinsics. > You could also try something called Kahan's method for > doing a summation. ?I don't remember how to do it, but > (I think) it's pretty good at summing unsorted values > and controlling roundoff. ?That would also eliminate > the need for the ValAbs* arrays and the sort2d calls. >{*filter*} Hendrickson > > Thank you kindly. > > Monir > > C > > ********************************************************************** > > ? ?SUBROUTINE EFFECT5 > > C > > ********************************************************************** > > ?EXTERNAL FUNC5R, MIDPNT, MIDINF > > ?REAL ? ?X(62,8), R(62,8), TH(62,8) > > ?REAL ? ?GAM(62), TANBIN(62) > > ?double precision ValX(496), ValR(496), ValT(496) > > ?double precision ValAbsX(496), ValAbsR(496), ValAbsT(496) > > ?double precision SSX, SSR, SST > > ?double precision SUMX, SUMR, SUMT > > ?double precision DX(1080), DR(1080), DT(1080) > > ?COMMON /DATA02/ DX, DR, DT > > ?COMMON /DATA03/ X, R, TH > > ?COMMON /DATA04/ GAM, TANBIN > > ?COMMON /DATA33/ NPOINTS, XPOINT, RPOINT, THPOINT > > ?COMMON /DATA55/ IPOINT, JVORT, NB > > C IPOINT is the index of point of evaluation in the incompressible > > steady rotational flow > > ?DO 15 IPOINT = 1, NPOINTS ?!max 1080 > > C.......my code 1 > > ? ? ?Kount = 0 > > ? ? ?DO 35 NB= 1, NB !max 8 > > ? ? ? ? ? ?DO 25 JVORT= 1, NF ?!max 62 > > ? ? ? ? ? ? ? ? ?Kount = Kount + 1 > > C.......my code 2 > > ? ? ? ? ? ? ? ? SSX= valueXX > > ? ? ? ? ? ? ? ? SSR= valueRR > > ? ? ? ? ? ? ? ? SST= valueTT > > C.......my code 3 > > ? ? ? ? ? ?ValX(Kount) = SSX*GAM(JVORT)*R(JVORT,1) > > ? ? ? ? ? ? ? ? ? ValAbsX(Kount)=ABS(ValX(Kount)) > > ? ? ? ? ? ?ValR(Kount) = SSR*GAM(JVORT)*R(JVORT,1) > > ? ? ? ? ? ? ? ? ? ValAbsR(Kount)=ABS(ValR(Kount)) > > ? ? ? ? ? ?ValT(Kount) = SST*GAM(JVORT)*R(JVORT,1) > > ? ? ? ? ? ? ? ? ? ValAbsT(Kount)=ABS(ValT(Kount)) > > 25 ? ? ? ?CONTINUE > > 35 ? ? CONTINUE > > C in order to minimize the roundoff errors, sort the absolute values > > in > > C ascending order, and then sum the actual values in the new order > > C > > ?call sort2d (kount, valAbsX, ValX) > > ?call sort2d (kount, valAbsR, ValR) > > ?call sort2d (kount, valAbsT, ValT) > > ? ? ? SUMX = 0.D0 > > ? ? ? SUMR = 0.D0 > > ? ? ? SUMT = 0.D0 > > ? ? ? DO 555 Index = 1, kount > > ? ? ? ? ? ? ? ? SUMX = SUMX + ValX(Index) > > ? ? ? ? ? ? ? ? SUMR = SUMR + ValR(Index) > > ? ? ? ? ? ? ? ? SUMT = SUMT + ValT(Index) > > 555 ? ? ? ? ?CONTINUE > > ? ? ? ? DX(IPOINT) = 0.5*SUMX > > ? ? ? ? DR(IPOINT) = 0.5*SUMR > > ? ? ? ? DT(IPOINT) = 0.5*SUMT > > 15 ?CONTINUE ? ? !outermost loop for ipoint > > ?Return > > ?END > > C ******************************************************************* > > ? ?SUBROUTINE SORT2D(N, RA, RB) > > C ******************************************************************* > > C > > ?double precision RA(N), RB(N) > > C.......my code 4 > > ?Return > > ?END Hide quoted text  >  Show quoted text  Hide quoted text  >  Show quoted text 
Dick; Thank you for your prompt reply and helpful suggestions. I'm sure EFFECT5 is the culprit. No doubt its subsidiary routine sort2d and the summation loop 555 consume a lot of time. The values involved in each sum vary from extremely large positive/ negative values to extremely small positive/negative values. In fact, some of the values are "almost" equal with opposite sign, which can result in serious loss of precision. To achieve a reasonable level of precision following a series of numerical experimentation, I decided a while back to sort the values before the summation, and hence the routine sort2d and loop 555. The results correlate extremely well with other theories. I've used F90 sporadically, but never used F95, so I'm not familiar, for example, with the use of "fortran SUM Intrinsic Function", even though I use the Intrinsic SUM in other programming languages (VBA). I reckon for a such use one would have to take extra care in declaring the exact arrays size, modify the common data blocks, adjust the equivalence statements (nightmare!), etc. to comply with the F95 instructions. You suggested the possibility of "... doing all 3 in one loop is more efficient ..." Would you be kind enough to elaborate on how ?? Maybe just couple of code lines to get the idea. I don't remember the "Kahan's method for summation". Could it be that you meant "Knuth" method ?? In any event, I suppose the method would require a routine to replace sort2d. Correct ?? Will check. Thanks again for your time and help. Monir

Mon, 22 Aug 2011 12:38:43 GMT 


moni #8 / 136

Possible F77 Code Improvement ??
Quote:
> (snip) > > in order to minimize the roundoff errors, sort > ?> the absolute values in ascending order, and then > ?> sum the actual values in the new order > I know that with one sign it is best to sort and add. > I am not so sure about mixed signs. ?I can easily imagine > that a large value of the opposite sign at the end wipes > out all the previously minimized roundoff errors. >  glen
Glen; I believe it is the subtraction of almost equal small values that can seriously affect the precision of the calculations. Routine sort2d is based on N.R. Heapsort method and has been tested extensively in the summation of very large arrays of mixed elements ranging from extremely large to extremely small values. Regards. Monir

Mon, 22 Aug 2011 12:40:19 GMT 


Glen Herrmannsfeld #9 / 136

Possible F77 Code Improvement ??
(snip) Quote: > I don't remember the "Kahan's method for summation". t > Could it be tha you meant "Knuth" method ??
http://en.wikipedia.org/wiki/Kahan_summation_algorithm#The_algorithm Quote: > In any event, I suppose the method would require a > routine to replace sort2d. Correct ?? Will check.
http://en.wikipedia.org/wiki/Kahan_summation_algorithm#Alternatives I think the way the suggested sorted sum method works for values of both signs would be something like: Sort the positive numbers. Sort the negative numbers. Start with the one with the smallest absolute value, and add it to the sum. Continue adding numbers from the positive list or negative list, whichever keeps the absolute value of the sum the smallest. If there aren't any more in one of the lists, just add the rest from the other list. It isn't so obvious to me that sorting by absolute value does that, but then it isn't obvious that it doesn't work, either.  glen

Mon, 22 Aug 2011 13:24:26 GMT 


moni #10 / 136

Possible F77 Code Improvement ??
Quote:
> (snip) > > I don't remember the "Kahan's method for summation". t > > Could it be tha you meant "Knuth" method ?? > http://en.wikipedia.org/wiki/Kahan_summation_algorithm#The_algorithm > > In any event, I suppose the method would require a > > routine to replace sort2d. ?Correct ?? ?Will check. > http://en.wikipedia.org/wiki/Kahan_summation_algorithm#Alternatives > I think the way the suggested sorted sum method works for values > of both signs would be something like: > Sort the positive numbers. > Sort the negative numbers. > Start with the one with the smallest absolute value, and add > it to the sum. ?Continue adding numbers from the positive list > or negative list, whichever keeps the absolute value of the > sum the smallest. > If there aren't any more in one of the lists, just add the > rest from the other list. > It isn't so obvious to me that sorting by absolute value > does that, but then it isn't obvious that it doesn't work, > either. >  glen
Thanks Glen. Will check both sites. Regards. Monir

Mon, 22 Aug 2011 13:39:08 GMT 


moni #11 / 136

Possible F77 Code Improvement ??
Quote:
> > ?DO 15 IPOINT = 1, NPOINTS ?!max 1080 > > C.......my code 1 > > ? ? ?Kount = 0 > > ? ? ?DO 35 NB= 1, NB !max 8 > > ? ? ? ? ? ?DO 25 JVORT= 1, NF ?!max 62 > > ? ? ? ? ? ? ? ? ?Kount = Kount + 1 > > C.......my code 2 > > ? ? ? ? ? ? ? ? SSX= valueXX > > ? ? ? ? ? ? ? ? SSR= valueRR > > ? ? ? ? ? ? ? ? SST= valueTT > > C.......my code 3 > > ? ? ? ? ? ?ValX(Kount) = SSX*GAM(JVORT)*R(JVORT,1) > > ? ? ? ? ? ? ? ? ? ValAbsX(Kount)=ABS(ValX(Kount)) > > ? ? ? ? ? ?ValR(Kount) = SSR*GAM(JVORT)*R(JVORT,1) > > ? ? ? ? ? ? ? ? ? ValAbsR(Kount)=ABS(ValR(Kount)) > > ? ? ? ? ? ?ValT(Kount) = SST*GAM(JVORT)*R(JVORT,1) > > ? ? ? ? ? ? ? ? ? ValAbsT(Kount)=ABS(ValT(Kount)) > > 25 ? ? ? ?CONTINUE > This appears to be vectorizable, for most common CPUs. ? If you use a > compiler which isn't capable of vectorization with multiple sum reductions > in one loop, it may be worth splitting it to gain vectorization. > > 35 ? ? CONTINUE > > C in order to minimize the roundoff errors, sort the absolute values > > in > > C ascending order, and then sum the actual values in the new order > > C > > ?call sort2d (kount, valAbsX, ValX) > > ?call sort2d (kount, valAbsR, ValR) > > ?call sort2d (kount, valAbsT, ValT) > It would likely be faster to sum in x87 extra precision instead of > sorting, even though that prevents effective vectorization. > > ? ? ? SUMX = 0.D0 > > ? ? ? SUMR = 0.D0 > > ? ? ? SUMT = 0.D0 > > ? ? ? DO 555 Index = 1, kount > > ? ? ? ? ? ? ? ? SUMX = SUMX + ValX(Index) > > ? ? ? ? ? ? ? ? SUMR = SUMR + ValR(Index) > > ? ? ? ? ? ? ? ? SUMT = SUMT + ValT(Index) > > 555 ? ? ? ? ?CONTINUE > > ? ? ? ? DX(IPOINT) = 0.5*SUMX > > ? ? ? ? DR(IPOINT) = 0.5*SUMR > > ? ? ? ? DT(IPOINT) = 0.5*SUMT > > 15 ?CONTINUE ? ? !outermost loop for ipoint Hide quoted text  >  Show quoted text  Hide quoted text  >  Show quoted text 
Tim; Quote: >"It would likely be faster to sum in x87 extra precision instead of sorting"
Do you mean using higher than double precision ?? Can I do that with F77/g95 compiler ?? Regards. Monir

Mon, 22 Aug 2011 13:48:23 GMT 


Glen Herrmannsfeld #12 / 136

Possible F77 Code Improvement ??
(someone wrote) Quote: >>"It would likely be faster to sum in x87 extra precision instead of sorting" > Do you mean using higher than double precision ?? > Can I do that with F77/g95 compiler ??
Sometimes it happens naturally, whether you want it or not. Sometimes that is good, other times not.  glen

Mon, 22 Aug 2011 14:05:17 GMT 


Dave Allure #13 / 136

Possible F77 Code Improvement ??
What is this? Quote: > COMMON /DATA55/ IPOINT, JVORT, NB > ... > DO 35 NB= 1, NB !max 8
Using the upper limit of a DO loop for the loop index of the same loop? If this is in your real code, then malfunction such as absurdly long execution time would be expected. Dave

Mon, 22 Aug 2011 15:36:49 GMT 


Terenc #14 / 136

Possible F77 Code Improvement ??
I think the sort is what is hurting your time most. Anothe may vbe using extra precision calculation in some place where you don't need it I'ma lso assuming the compiler will optimise some of your code that has repeated terms. . I know why you do this sorting, but here's my main points. 1) If the results of the summations you are performing are expected to be LARGE, it won't have much effect but if the result should be SMALL, then DO NOT separate the positives and negative values before summing because in this particular case you could LOSE precision unless you peformed the summation with still double further precision!. It's the old rule of not subtracting two large numbers 2) If you do a sort, do it the most efficient way possible.

Mon, 22 Aug 2011 16:36:51 GMT 


JB #15 / 136

Possible F77 Code Improvement ??
Quote: > I'm sure EFFECT5 is the culprit.
You're "sure", as in "I've got profiling data", or "just because" (which tends to be worthless)? Quote: > No doubt
"No doubt"? I'm beginning to suspect you actually haven't profiled your program, which is the first thing to do in any optimization exercise. As an aside, modern sampling profilers can profile optimized code with very low overhead. Quote: > its subsidiary routine > sort2d and the summation loop 555 consume a lot of time.
You previously mentioned that sort2d uses heapsort, which is O(N*log(N)), whereas the summation is O(N), so obviously for large N the sorting time will dominate. By using the Kahan summation algorithm, you might be able to get rid of the sorting without suffering in accuracy. Or a cheaper way might be to declare the sum accumulator as extended precision. Quote: > I've used F90 sporadically, but never used F95, so I'm not familiar, > for example, with the use of "Fortran SUM Intrinsic Function", even > though I use the Intrinsic SUM in other programming languages (VBA). > I reckon for a such use one would have to take extra care in declaring > the exact arrays size, modify the common data blocks, adjust the > equivalence statements (nightmare!), etc. to comply with the F95 > instructions.
Maybe, if your program uses all kinds of nonstandard tricks. For standard conforming code, the F95 SUM intrinsic should work out of the box. Quote: > You suggested the possibility of "... doing all 3 in one loop is more > efficient ..." > Would you be kind enough to elaborate on how ??
Looping over one array at a time might give better cache behaviour. Or then it that won't matter, and your fused loops are better due to providing the CPU with more opportunities for parallelism. As previously mentioned, profiling is the tool you can use to determine which is better in your case.  JB

Mon, 22 Aug 2011 17:26:32 GMT 


