Atomic operation performance

I have experimented the performance of atomicadd function on C2050 machine.
We measured execution time on case where there is no memory conflict and possible coalescing access on global memory.
We fixed block size and increased threads per block.

According to the result of our experiment,
the performance of atomicadd was rapidly increased when the number of thread was the multiple of half warp.
The performance was increased more when it was the multiple of warp size.

I wonder this reason.

I attached the graph about our experiment.

block size: 140
X-axis: thread per block
Y-axis: time(ms)
red line: atomicadd, non memory conrflict, coalsecing access
blue line: += operation, non memory conrflict, coalsecing access
atomicVSnonatomic.pdf (169 KB)

Could you describe how to do use atomic operation, each thread access different memory location?
what do you mean “coalesce access” on these atomic operations ?

Can you add random conflict to test case?
(this will ensure that no specific code optimization is made based on coalsecing access & fully loaded warps)

__global__ test(int *g_data)


	int tid = threadIdx.x;

	int bid = blockIdx.x;

	int nthreads = blockDim.x;

	int id = tid + bid*nthreads;

	int r_value;

	int *data;

	r_value = atomicAdd(&g_data[id], 1);


int main(int argc, char** argv)



	cutilSafeCall(cudaMalloc((void**)&g_data, sizeof(int) * blocks * threads));

	cutilSafeCall(cudaMemset(g_data, 0, sizeof(int)  * blocks * threads));

	cutilSafeCall(cudaEventRecord(start2, 0));

	test<<<blocks, threads>>>(g_data);

	cutilSafeCall(cudaEventRecord(stop2, 0));




When block size is 140 and threads per block is increased from 1 to 1024,

the result is a red line in graph.

I don’t understand why random conflict test is needed.

Can you explain more detail?

This is too simple code. In case of full warp may be device or compiler (to device code) is smart enough to replace atomic_add to simple add instruction.

It will be much more interesting if small amount of inc conflicts appears

And probably performance degradation dependency of conflict % is much more interesting than performance itself.

I.e. programmer writes atomic_add in case if he can’t guarantee absents of conflict, but they can be very rare.

For the compiler to replace with simple add, the compiler should be able to know that all threads are adding the “same number” at the warp-level. Unless and until you are using a “constant” to do it, it will be very difficult for the compiler to diagnose.

For the device to do it, it needs to have comparators to do this. On the outset, it looks like a complex work that will come handy only in some corner cases anyway…