I’m having a problem with some CUDA code that looks like this
if(tid < 16)
atomicAdd(&pixeldata[(blocky + (tid >> 2)) * pixelpitch + blockx + (tid & 3)], splat);
pixeldata is an array of floats in global memory and splat is a float in a register
blockx and blocky are 2d coordinates of the current thread block
The 16 threads with tid < 16 write to a patch of 4 x 4 pixels in pixeldata. The atomicAdd is required because adjacent blocks of threads are also trying to write to overlapping patches of 4 x 4 pixels.
My problem is that when I compare images from different kernel executions I get a small percentage of pixels (which are distributed apparently randomly and in a non-reproducible way) with differences. I realize that the order of addition could be changing between executions so floating point rounding errors might be a factor but the differences are in some cases quite large so I think that something more serious is happening.
Do I need to use __syncthreads() or __threadfence() or volatile when using atomicAdd()?