for( int i = 0; i < 6000; i++ )
{
if( weight[i]>0)
{
device_findweighteraverage[i*3] /= weight[i]
}
}

the above program is giving small difference while running in GPU and CPU. The emulation mode gives same output as CPU. What may be the reason? Is there any difference in accuracy of floating point operation between CPU and GPU? I am using 8800GTS card with CUDA1.0. The error is occuring at 6th decimal point. I cant afford this error because i am doin a lot of calculation based on this and the error is getting accumulated during every phase and it is going out of my error range. Is there any way to prevent this?

I am running the program with one thread and one block. So there is no threading or synchronization issue.

value CPU: 0.571534;-1.329878

value GPU: 0.571535;-1.329879

the difference is not comming at every location. It is only coming in some locations.

CPU and GPU hadle float values differently. CPU uses internal 80-bit (‘extended’) representation and GPU uses 32-bit (single precision) representation. This is why you get different results on GPU and CPU.

6th decimal point is all the accuracy you can expect from single precision floating points. Even on the CPU: just summing values in a different order can cause fluctuations in the 6th decimal place.

You might be able to approximate the GPU precision on the CPU by using setting the appropiate floating point unit precision control word. _controlfp(…) in Visual Studio does this and there are similar function on Linux.

You may also be able to get better accuracy (on both the CPU and GPU!) by using a parallel reduction (aka pair-wise summation) instead. There are actually quite a lot of papers published just on accurate floating-point summation.

You might want to look at the “reduction” sample in the CUDA 1.1 SDK. We have also found that by increasing the number of leaves in the summation tree (which translates to the amount of parallelism) you can also increase accuracy. There is an accuracy / performance trade off beyond a certain point, though.

There is some discussion about this in our updated Monte Carlo white paper in the CUDA 1.1 SDK, which I’ll excerpt here:

It is something like finding average once and comparing with some other data. Then the data will be magnified a bit more and finding average and comparing with the other data for best match. Anyway thanks for the help. Right now i moved that division part alone to CPU. It doent cost me much of a performance issue. But i wonder when i can program GPU exactly same as CPU :).

Average would be a sum, followed by a multiplication by the reciprocal of the number. (no division necessary) Your computation is not an average – what is it?

The code what i have written is just a sample. My actual averaging is something different. I am making a histogram and dividing the location from which each value is taken with number of locations from where the value is selected. So it is ofcourse not a pure average. It is just a part of a long processing algorithm

OK, that would require only one division per histogram bin. Your example was 6000 divisions of the same variable. The latter would have much higher error, and it doesn’t sound like the latter even comes up in real code.

If you are careful to maximize the accuracy of your summation, and minimize the number of divisions, you can maximize your accuracy.