Atomic Operations

Dante003 · November 7, 2015, 10:24pm

So I did two simple programs to test atomic operations performance, as I expect a hit on performance because of the serialization that happens when using atomic functions.

So I implemented this code, the “main function” is the same, only the kernel changes. The problem that I saw is that the performance with atomic operations is two times faster than without atomic.

With atomic operation.

#include <stdio.h>
__global__ void add(int *a){
	int id = blockDim.x * blockIdx.x + threadIdx.x;
	int idx = id % 10;

	atomicAdd(&a[idx],10);
}

int main (int argc, char* argv[]){
	int n = 128000;
	size_t size = 10*sizeof(int);
	int *h_a = (int*)malloc(size);

	int i = 0;
	for (i = 0; i < 10; i++)
		h_a[i] = 0;

	int *d_a;
	cudaMalloc(&d_a, size);

	cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);

	add<<<n/128,128>>>(d_a);

	cudaMemcpy(h_a, d_a, size, cudaMemcpyDeviceToHost);

	for (i = 0; i < 10; i++)
		printf("%d ", h_a[i]);
	printf("\n");

	return 0;
}

Without atomic

__global__ void add(int *a){
	int id = blockDim.x * blockIdx.x + threadIdx.x;
	int idx = id % 10;

	a[idx] = a[idx] + 10;
}

So whats to reason for atomic operations been faster?

Robert_Crovella · November 8, 2015, 12:33am

Questions about performance claims or analysis should generally include (in addition to the code):

A description of the platfom, e.g. OS, version, etc.
CUDA version
compile command line
GPU used for test
timing methodology (if it’s not built into the code)

When I run your code on a GTX960 on Fedora20,CUDA 7.5, and use nvprof --print-gpu-trace for timing, I get a reported time of 42us for the non-atomic kernel, and 72us for the atomic version.

So according to my test, the atomic operations are not faster.

Compile command line:

nvcc -O3 -o t5 t5.cu -arch=sm_52

Dante003 · November 8, 2015, 9:13pm

Ok. So Im using a cluster node with the following config:

Intel® Core™ i7-4790 @3.6 GHz;
Nvidia GTX 650 1 GB
32 GB RAM DDR3;
Gigabyte GA-Z97X-SLI;
Linux Ubuntu Server 14.04.2 LTS 64 Bits (Kernel 3.13.0-57-generic);
SDK CUDA v6.5

Compiled with
nvcc atom.cu -o atom (tested with -arch=sm_30, no difference).

I used nvprof and the timing that I got for atomic version is

93.09%  26.304us         1  26.304us  26.304us  26.304us  add(int*)

and non atomic

96.37%  53.535us         1  53.535us  53.535us  53.535us  add(int*)

LongY · November 11, 2015, 5:44am

I also run both code on the GPU I have.

The first run was on a NVS 315 on Ubuntu, CUDA 7.0. The timing I got from nvprof is that non-atomic takes more time than atomic, which is obviously unacceptable. Theoretically, atomic operations make access serialize when multiple threads access the same address simultaneously which results in slower performance.

this is the result from NVS 315
without atomic

99.37%  375.80us         1  375.80us  375.80us  375.80us  add(int*)

with atomic

98.70%  181.54us         1  181.54us  181.54us  181.54us  add(int*)

I ran the code multiple times, also tried nvprof with --print-gpu-trace, and still got the same result like Dante003 which is atomic is faster than non-atomic.

The second run was on GTS 450 on Ubuntu, CUDA 7.0. The result from this GPU is that atomic is slower than non-atomic, like txbob’s test.

without atomic

95.74%  61.165us         1  61.165us  61.165us  61.165us  add(int*)

with atomic

98.32%  157.29us         1  157.29us  157.29us  157.29us  add(int*)

anyone has any thoughts?

allanmac · November 11, 2015, 8:26pm

I don’t have any answers but one difference to consider is that an atomicAdd() that ignores the return value is compiled to a “RED.ADD” reduction operation and, if my understanding of Kepler is correct, the atomic add will be processed external to the SMX.

This doesn’t explain your observation but might hint at an even more complex interaction between the SMX, L2, atomics, clock speed and the memory interface on the GK107 GPU.

But dumping the SASS is always illuminating…

The non-atomic version of your routine requires three instructions and a roundtrip to global memory:

/*0058*/                   LD R0, [R2];                        /* 0x8000000000201c85 */
/*0060*/                   IADD R0, R0, 0xa;                   /* 0x4800c00028001c03 */
/*0068*/                   ST [R2], R0;                        /* 0x9000000000201c85 */

The atomic version is one instruction and, depending on the arch and target memory space, may be performed by hardware external to the SMX:

/*0060*/                   RED.ADD [R0], R2;                   /* 0x1000000000009c05 */

There was a good discussion on this subject at GTC 2013: Understanding and Using Atomic Memory Operations [PDF][FLV][Stream].

Another thought, I wonder if you would see the same timings if your GTX 650 was running at maximum speed and/or you ran the test more than once?

Topic		Replies	Views
+= slower than atomicAdd - is there an alternate method? CUDA Programming and Performance	1	3214	December 3, 2012
Atomic add running faster than naive add CUDA Programming and Performance	1	646	February 2, 2023
Atomic Operations and Clock Cycles CUDA Programming and Performance	1	1054	December 5, 2010
AtomicAdd faster than coalesced add. What is going on? GTX 275, consistently reproduceable CUDA Programming and Performance	2	1898	November 22, 2009
Why does the atomicAdd work faster than â€˜+=â€™? CUDA Programming and Performance	3	12279	November 2, 2011
CUDA - atomicAdd Efficiency issues CUDA Programming and Performance cuda	2	175	November 17, 2024
AtomicAdd CUDA Programming and Performance	5	7904	April 4, 2008
Fermi atomic op 10 times slower than ATI GPU? CUDA Programming and Performance	4	10063	July 25, 2011
Get different results for every running with atomicAdd() CUDA Programming and Performance	2	436	October 3, 2022
Taking apart global atomics performance performance, graphs, theories CUDA Programming and Performance	23	7999	January 28, 2012

Atomic Operations

Related topics