CPU and CUDA code yield different results?

I am currently trying to port some CPU code over to CUDA for performance reasons, but I am getting slightly different results - on the order of 1E-12. Is this normal or is there something I am overlooking? I have tried to force both GPU and CPU to use double precision.

While this initial error is not a problem, it runs repeatedly (with each instance dependent on the previous instance) and the error builds to unacceptable levels. Any ideas on how to fix this?

For reference, this is the simplest code segment that yields a difference:

for(int vert = 0; vert < vertexForces.GetLength(0); vert++)
vertices[vert, 0] += vertexForces[vert, 0] * (0.5 / 3.0) / maxForce[0];
vertices[vert, 1] += vertexForces[vert, 1] * (0.5 / 3.0) / maxForce[0];
vertices[vert, 2] += vertexForces[vert, 2] * (0.5 / 3.0) / maxForce[0];

int vert = thread.blockIdx.x * thread.blockDim.x + thread.threadIdx.x;

if (vert < vertexForces.GetLength(0))
vertices[vert, 0] += vertexForces[vert, 0] * (((double)0.5) / ((double)3.0)) / maxForce[0];
vertices[vert, 1] += vertexForces[vert, 1] * (((double)0.5) / ((double)3.0)) / maxForce[0];
vertices[vert, 2] += vertexForces[vert, 2] * (((double)0.5) / ((double)3.0)) / maxForce[0];

The differences are likely due to the use of the FMA (fused multiply-add) instruction on the GPU and it’s likely that this causes the GPU results to me more accurate, although only a comparison with a higher-precision reference could tell one way ot the other.

(1) Check out this whitepaper: https://developer.nvidia.com/sites/default/files/akamai/cuda/files/NVIDIA-CUDA-Floating-Point.pdf

(2) Try the compiler switch -fmad=false

If you are seeing accumulated difference between the CPU and GPU building to unacceptable levels, then it is quite possible that your original CPU code also has an accumulated error problem. Long sums can destroy precision quite rapidly, regardless of the size of your floats.

Sure enough, that was it - apparently the “error” is more accurate.

Thank you both,