floating point error Error with floating point division

enjoyamalp · November 28, 2007, 10:45am

for( int i = 0; i < 6000; i++ )

{

   if( weight[i]>0)

   {

     device_findweighteraverage[i*3] /= weight[i]

   }

}

the above program is giving small difference while running in GPU and CPU. The emulation mode gives same output as CPU. What may be the reason? Is there any difference in accuracy of floating point operation between CPU and GPU? I am using 8800GTS card with CUDA1.0. The error is occuring at 6th decimal point. I cant afford this error because i am doin a lot of calculation based on this and the error is getting accumulated during every phase and it is going out of my error range. Is there any way to prevent this?

I am running the program with one thread and one block. So there is no threading or synchronization issue.

value CPU: 0.571534;-1.329878

value GPU: 0.571535;-1.329879

the difference is not comming at every location. It is only coming in some locations.

AndreiB · November 28, 2007, 11:36am

This is expected behaviour.

CPU and GPU hadle float values differently. CPU uses internal 80-bit (‘extended’) representation and GPU uses 32-bit (single precision) representation. This is why you get different results on GPU and CPU.

MisterAnderson42 · November 28, 2007, 1:16pm

6th decimal point is all the accuracy you can expect from single precision floating points. Even on the CPU: just summing values in a different order can cause fluctuations in the 6th decimal place.

wildcat4096 · November 28, 2007, 3:04pm

You might be able to approximate the GPU precision on the CPU by using setting the appropiate floating point unit precision control word. _controlfp(…) in Visual Studio does this and there are similar function on Linux.

Mark_Harris · November 29, 2007, 11:46am

You may also be able to get better accuracy (on both the CPU and GPU!) by using a parallel reduction (aka pair-wise summation) instead. There are actually quite a lot of papers published just on accurate floating-point summation.

A good 2-page short paper from 1970 is:

http://portal.acm.org/citation.cfm?id=362498&dl=GUIDE&dl=ACM

You might want to look at the “reduction” sample in the CUDA 1.1 SDK. We have also found that by increasing the number of leaves in the summation tree (which translates to the amount of parallelism) you can also increase accuracy. There is an accuracy / performance trade off beyond a certain point, though.

There is some discussion about this in our updated Monte Carlo white paper in the CUDA 1.1 SDK, which I’ll excerpt here:

Accurate Summation

Floating point summation is an extremely important and common computation for a wide variety of numerical applications. As a result, there is a large body of literature on the analysis of accuracy of many summation algorithms [6, 7]. The most common sequential approach, often called recursive summation, in which values are added sequentially, can lead to a large amount of round-off error. Intuitively, as the magnitude of the sum gets very large relative to the summands, the amount of round-off error increases. This can lead to catastrophic errors. By reordering the summation (i.e. sorting in order of increasing magnitude) error can be reduced, but this doesn’t help if all of the input values have similar values (which may be the case in Monte Carlo option pricing).

Instead of adding all the values into a single sum, we can maintain multiple partial sums. If we add the same number of values into each partial sum, and the input values are similar in magnitude, the partial sums will likewise all be similar magnitude, so that when they are added together, the round-off error will be reduced. If we extend this idea, we get pair-wise summation [6], which results in a summation tree just like the one we use in our parallel reduction. Thus, not only is parallel reduction efficient on GPUs, but it can improve accuracy!

In practice, we found that by increasing the number of leaf nodes in our parallel reduction, we can significantly improve the accuracy of summation (as measured by the L1-norm of the error when comparing our GPU Monte Carlo against a double-precision CPU Monte Carlo implementation). Specifically, we reduced the L1-norm error from 7.6e-7 to 6e-8 by increasing the size of the shared memory reduction array s_SumCall from 128 to 1024 elements. The “MonteCarlo” SDK sample does not do this by default because the accuracy improvement is small compared to the error when comparing to Black-Scholes results, and because the additional accuracy comes at a performance cost of about 5%. This additional accuracy may, however, be important in real-world applications, so we provide it as an option in the code. The size of the reduction array in the code can be modified using the SUM_N parameter to sumReduce().

Linz, Peter. “Accurate Floating-Point Summation”. Communications of the ACM, 13

(1970), pp. 361-362.

Higham, Nicholas J. “The accuracy of floating point summation”. SIAM Journal on

Scientific Computing, Vol. 14, No. 4 (1993), pp. 783-799.

Mark

Mark_Harris · November 29, 2007, 9:14pm

Hmmm, I misread the original code – someone pointed out that it’s floating point division, not summation. :)

Currently our division can have up to 2 ulps of error, which probably explains your accumulated error.

Hopefully the info about summation was useful anyway. :) Just curious – what kind of algorithm is this where you are doing repeated division?

Mark

enjoyamalp · November 30, 2007, 5:08am

It is something like finding average once and comparing with some other data. Then the data will be magnified a bit more and finding average and comparing with the other data for best match. Anyway thanks for the help. Right now i moved that division part alone to CPU. It doent cost me much of a performance issue. But i wonder when i can program GPU exactly same as CPU :).

Mark_Harris · November 30, 2007, 10:04am

Average would be a sum, followed by a multiplication by the reciprocal of the number. (no division necessary) Your computation is not an average – what is it?

Mark

enjoyamalp · November 30, 2007, 3:15pm

Hi Mark,

The code what i have written is just a sample. My actual averaging is something different. I am making a histogram and dividing the location from which each value is taken with number of locations from where the value is selected. So it is ofcourse not a pure average. It is just a part of a long processing algorithm

Thanks a lot,

Amal P

Mark_Harris · November 30, 2007, 5:04pm

OK, that would require only one division per histogram bin. Your example was 6000 divisions of the same variable. The latter would have much higher error, and it doesn’t sound like the latter even comes up in real code.

If you are careful to maximize the accuracy of your summation, and minimize the number of divisions, you can maximize your accuracy.

Mark

Topic		Replies	Views
floating point precision on CUDA CUDA Programming and Performance	11	14650	June 8, 2010
Floats and floats... difference between CPU and GPU? CUDA Programming and Performance	12	13994	February 2, 2010
floating point precision CUDA Programming and Performance	3	1460	April 10, 2009
Floating Point Accuracy CUDA Programming and Performance	11	30414	April 6, 2013
Float accuracy CUDA Programming and Performance	16	9360	July 22, 2010
precision CUDA Programming and Performance	3	2614	December 16, 2008
Is there a difference between GPU double precision and CPU double precision? CUDA Programming and Performance	14	10520	November 26, 2009
Floating-point precision problems CUDA Programming and Performance	14	4360	January 7, 2011
Computing mean and standard deviation in parallel Cna we extend Parallel Reduction? CUDA Programming and Performance	6	13751	April 30, 2009
discrepancy between CPU and GPU after a division (accuracy issue) CUDA Programming and Performance	3	1478	June 10, 2015

floating point error Error with floating point division

Related topics