Floating Point arithmetic problem
Author Message
Floating Point arithmetic problem

Quote:

>This addition  usually results in  a  few insignificant digits getting
>accumulated in my double variable. For eg.

>  f = 12.99  is represented as 12.9900000000000002131628207
>  f = 11.111 is represented as 11.1110000000000006536993169

>and so on. I am interested in comparisons with 4 digits precision.

>Now coming  to   the  problem. The  insignificant  digits   due to the
>floating  point representation keep  accruing and  there comes a stage
>when the accrued value exceeds 0.0001 which  results in failure of the
>if condition in the  above block of code,  when ideally no such  thing
>should have occurred.

You are running into a fundamental problem with floating point
numbers.  Assume that your values have a magnitude of 1e4 (which you
say is the limit).  Assume further that 8-byte IEEE-754 arithmetic is
used.  With approximately 17 digits of precision, the lsb is
approximately 1e-13.  If you add 1e6 numbers and the contributed
rounding error from each addition is 1/2 lsb (a hand-waving, but
probably reasonable sort of assumption), the accumulated error would
be on the order of 5e-8.  That is better than your tolerance of 1e-4.

The retained precision also depends on the order in which the numbers
are added.  The above analysis assumes that the accumulated value
stays less than 1e4.  If, instead, you added 5e5 positive values of
approximately 1e4, THEN 5e5 negative values of opposite value which
result in a theoretical sum near zero, the accumulated value will
temporarily go to approximately 5e9, where the typical 1/2 lsb error
would then be about 5e-9.  Multiply that by 1e6 and get 5e-3, which is
greater than you tolerance.

Quote:
>All I need is a solution that will  overcome this problem. Please
bear
>in mind  that the  loop is executed  millions  of times and  hence any
>costly operation     within  the loop    with  drastically  bring down
>performance.

Avoid accumulating large values which will eventually cancel.  See if
your implementation supports a long double type with greater
precision than double.

Quote:
>I have a few options to overcome this problem :-

>* After each   addition,   covert  'd' to   an  unsigned   long  after
>  multiplying  by say 1e8, (thus   truncating the unnecessary digits),
>  and divide it by 1e8 to get back the original value.

That would, for arbitrary operands, make matters much worse by adding
might be multiples of .001 or .0001.  If that is the case, you would
be much better off scaling your values by a multiplier that results in
all values being integers, i.e., multiplying by 10000, before summing
them.  This will reduce the noise produced when adding the numbers,
probably enough to give you a perfect solution.  Obviously you need to
then divide the final sum by the scaling value to get your intended
sum.

Mon, 03 Aug 1998 03:00:00 GMT
Floating Point arithmetic problem

Quote:

>That would, for arbitrary operands, make matters much worse by adding
>might be multiples of .001 or .0001.  If that is the case, you would
>be much better off scaling your values by a multiplier that results in
>all values being integers, i.e., multiplying by 10000, before summing
>them.  This will reduce the noise produced when adding the numbers,
>probably enough to give you a perfect solution.  Obviously you need to

Yes. If there are enough significant bit in the mantissa of whatever
floating-point representation is being used, then all the integers can be
represented. E.g. the double-precision IEEE format has 52 bits, which is plenty
for storing perfect representations of 32-bit integers. The solution I
presented which converts an integer loop variable into floating point may be
overkill if every integer can be accurately represented.
--

Tue, 04 Aug 1998 03:00:00 GMT

 Page 1 of 1 [ 2 post ]

Relevant Pages