Possible problem with atomic on global memory

sam_hawker · November 18, 2013, 5:09pm

I’m having a problem with some CUDA code that looks like this

if(tid < 16)
atomicAdd(&pixeldata[(blocky + (tid >> 2)) * pixelpitch + blockx + (tid & 3)], splat);

pixeldata is an array of floats in global memory and splat is a float in a register

blockx and blocky are 2d coordinates of the current thread block

The 16 threads with tid < 16 write to a patch of 4 x 4 pixels in pixeldata. The atomicAdd is required because adjacent blocks of threads are also trying to write to overlapping patches of 4 x 4 pixels.

My problem is that when I compare images from different kernel executions I get a small percentage of pixels (which are distributed apparently randomly and in a non-reproducible way) with differences. I realize that the order of addition could be changing between executions so floating point rounding errors might be a factor but the differences are in some cases quite large so I think that something more serious is happening.

Do I need to use __syncthreads() or __threadfence() or volatile when using atomicAdd()?

allanmac · November 18, 2013, 7:14pm

You shouldn’t need barriers unless your issue is related to ensuring the patches are initialized before you start atomically updating them. Perhaps some threads are writing to the patches but not via atomicAdd()?

I spent the last week debugging code that scatters ints with atomicAdd()'s. I fixed a few subtle bugs but the atomic ops weren’t the issue… :)

sam_hawker · November 18, 2013, 7:32pm

No, the only thing that writes to pixeldata is the atomicAdd(). Initialization is done using a memcpy before launching the kernel.

I haven’t 100% verified that the problem isn’t earlier in the kernel but that’s basically just a reduction in shared memory (pretty much following the SDK example). For debugging purposes I guess I can split it into 16 passes and only output non-overlapping values on each pass.

allanmac · November 18, 2013, 7:36pm

You could try setting ‘splat’ to 0.0f (and then 1.0f) for now and see if the sums match your expectations. It sounds subtle…

sam_hawker · November 19, 2013, 2:31pm

Thanks for that suggestion. I haven’t tried that yet.

I did try my idea of splitting the output into 16 passes though. I do still get variations between executions which indicates that there is definitely a problem before the global atomic.

The problem seems to be in the final warp-synchronous part of the reduction. If add the extra __syncthreads() needed to make this “safe” (and if I still use 16 passes to ensure I don’t get any reordering of additions) then my variations disappear. I know that this kind of warp-synchronous programming is now deprecated but my understanding was that it should still produce correct results on current hardware. Is that not the case? This reduction code is still in the SDK examples after all.

sam_hawker · November 19, 2013, 4:02pm

More specifically its only the first omitted __syncthreads() that actually needs to be put back in.

	if(tid < 32)
		vintensitydata[tid] += vintensitydata[tid + 32];
	__syncthreads(); // THIS IS ACTUALLY NEEDED (ALTHOUGH ABSENT IN THE SDK EXAMPLE)
	if(tid < 16)
	{
		vintensitydata[tid] += vintensitydata[tid + 16];
		vintensitydata[tid] += vintensitydata[tid + 8];
		vintensitydata[tid] += vintensitydata[tid + 4];
		vintensitydata[tid] += vintensitydata[tid + 2];
		if(tid == i)
			splat = vintensitydata[0] + vintensitydata[1];
	}

CudaaduC · November 20, 2013, 5:35am

sam_hawker:

More specifically its only the first omitted __syncthreads() that actually needs to be put back in.

	if(tid < 32)
		vintensitydata[tid] += vintensitydata[tid + 32];
	__syncthreads(); // THIS IS ACTUALLY NEEDED (ALTHOUGH ABSENT IN THE SDK EXAMPLE)
	if(tid < 16)
	{
		vintensitydata[tid] += vintensitydata[tid + 16];
		vintensitydata[tid] += vintensitydata[tid + 8];
		vintensitydata[tid] += vintensitydata[tid + 4];
		vintensitydata[tid] += vintensitydata[tid + 2];
		if(tid == i)
			splat = vintensitydata[0] + vintensitydata[1];
	}

Try declaring the shared memory volatile(i.e. shared volatile float vintensitydata[THREADS]).

and take out that last __synthreads().

Keep in mind I cannot see the rest of the code, but often you you use the keyword’volatile’ in order to prevent some weird compiler re-ordering.

sam_hawker · November 20, 2013, 7:16am

I already have vinstensitydata declared as

volatile float *vintensitydata = intensitydata;

(as per the reduction example in the SDK).

sam_hawker · November 20, 2013, 7:49am

I haven’t benchmarked the code yet so I don’t know if this part is even performance critical. If it is then I’ll probably rewrite it. I think I can use the last warp more efficiently anyway. At the moment I’m doing 16 parallel reductions sequentially but I could stop each one as soon as I get down to the last warp and do then 16 sequential reductions in parallel instead.

Topic		Replies	Views
Global memory access how to access the same location sequentially from different threads CUDA Programming and Performance	4	4446	July 29, 2010
Threads and Race Condition CUDA Programming and Performance	11	3154	April 30, 2012
Issues with atomicAdd when doing Inter block sync CUDA Programming and Performance	12	4190	August 19, 2010
atomic read or write CUDA Programming and Performance	3	4260	July 15, 2009
Get different results for every running with atomicAdd() CUDA Programming and Performance	2	441	October 3, 2022
atomicAdds within two loops CUDA Programming and Performance	5	942	October 12, 2021
atomicAdd() problem... CUDA Programming and Performance	0	2623	June 6, 2011
Race condition? CUDA Programming and Performance	6	8343	December 5, 2009
Mutual exclusion or Reduction on global memory? CUDA Programming and Performance	13	9284	September 13, 2008
atomicAdd not behaving as expected, atomicAdd_system not defined CUDA Programming and Performance	3	1705	September 5, 2022

Possible problem with atomic on global memory

Related topics