I am currently trying to port some CPU code over to CUDA for performance reasons, but I am getting slightly different results - on the order of 1E-12. Is this normal or is there something I am overlooking? I have tried to force both GPU and CPU to use double precision.
While this initial error is not a problem, it runs repeatedly (with each instance dependent on the previous instance) and the error builds to unacceptable levels. Any ideas on how to fix this?
For reference, this is the simplest code segment that yields a difference:
CPU:
for(int vert = 0; vert < vertexForces.GetLength(0); vert++)
{
vertices[vert, 0] += vertexForces[vert, 0] * (0.5 / 3.0) / maxForce[0];
vertices[vert, 1] += vertexForces[vert, 1] * (0.5 / 3.0) / maxForce[0];
vertices[vert, 2] += vertexForces[vert, 2] * (0.5 / 3.0) / maxForce[0];
}
GPU:
int vert = thread.blockIdx.x * thread.blockDim.x + thread.threadIdx.x;
if (vert < vertexForces.GetLength(0))
{
vertices[vert, 0] += vertexForces[vert, 0] * (((double)0.5) / ((double)3.0)) / maxForce[0];
vertices[vert, 1] += vertexForces[vert, 1] * (((double)0.5) / ((double)3.0)) / maxForce[0];
vertices[vert, 2] += vertexForces[vert, 2] * (((double)0.5) / ((double)3.0)) / maxForce[0];
}