Possible problem with atomic on global memory

I’m having a problem with some CUDA code that looks like this

if(tid < 16)
atomicAdd(&pixeldata[(blocky + (tid >> 2)) * pixelpitch + blockx + (tid & 3)], splat);

pixeldata is an array of floats in global memory and splat is a float in a register

blockx and blocky are 2d coordinates of the current thread block

The 16 threads with tid < 16 write to a patch of 4 x 4 pixels in pixeldata. The atomicAdd is required because adjacent blocks of threads are also trying to write to overlapping patches of 4 x 4 pixels.

My problem is that when I compare images from different kernel executions I get a small percentage of pixels (which are distributed apparently randomly and in a non-reproducible way) with differences. I realize that the order of addition could be changing between executions so floating point rounding errors might be a factor but the differences are in some cases quite large so I think that something more serious is happening.

Do I need to use __syncthreads() or __threadfence() or volatile when using atomicAdd()?

You shouldn’t need barriers unless your issue is related to ensuring the patches are initialized before you start atomically updating them. Perhaps some threads are writing to the patches but not via atomicAdd()?

I spent the last week debugging code that scatters ints with atomicAdd()'s. I fixed a few subtle bugs but the atomic ops weren’t the issue… :)

No, the only thing that writes to pixeldata is the atomicAdd(). Initialization is done using a memcpy before launching the kernel.

I haven’t 100% verified that the problem isn’t earlier in the kernel but that’s basically just a reduction in shared memory (pretty much following the SDK example). For debugging purposes I guess I can split it into 16 passes and only output non-overlapping values on each pass.

You could try setting ‘splat’ to 0.0f (and then 1.0f) for now and see if the sums match your expectations. It sounds subtle…

Thanks for that suggestion. I haven’t tried that yet.

I did try my idea of splitting the output into 16 passes though. I do still get variations between executions which indicates that there is definitely a problem before the global atomic.

The problem seems to be in the final warp-synchronous part of the reduction. If add the extra __syncthreads() needed to make this “safe” (and if I still use 16 passes to ensure I don’t get any reordering of additions) then my variations disappear. I know that this kind of warp-synchronous programming is now deprecated but my understanding was that it should still produce correct results on current hardware. Is that not the case? This reduction code is still in the SDK examples after all.

More specifically its only the first omitted __syncthreads() that actually needs to be put back in.

	if(tid < 32)
		vintensitydata[tid] += vintensitydata[tid + 32];
	__syncthreads(); // THIS IS ACTUALLY NEEDED (ALTHOUGH ABSENT IN THE SDK EXAMPLE)
	if(tid < 16)
	{
		vintensitydata[tid] += vintensitydata[tid + 16];
		vintensitydata[tid] += vintensitydata[tid + 8];
		vintensitydata[tid] += vintensitydata[tid + 4];
		vintensitydata[tid] += vintensitydata[tid + 2];
		if(tid == i)
			splat = vintensitydata[0] + vintensitydata[1];
	}

Try declaring the shared memory volatile(i.e. shared volatile float vintensitydata[THREADS]).

and take out that last __synthreads().

Keep in mind I cannot see the rest of the code, but often you you use the keyword’volatile’ in order to prevent some weird compiler re-ordering.

I already have vinstensitydata declared as

volatile float *vintensitydata = intensitydata;

(as per the reduction example in the SDK).

I haven’t benchmarked the code yet so I don’t know if this part is even performance critical. If it is then I’ll probably rewrite it. I think I can use the last warp more efficiently anyway. At the moment I’m doing 16 parallel reductions sequentially but I could stop each one as soon as I get down to the last warp and do then 16 sequential reductions in parallel instead.