Atomic Operations

So I did two simple programs to test atomic operations performance, as I expect a hit on performance because of the serialization that happens when using atomic functions.

So I implemented this code, the “main function” is the same, only the kernel changes. The problem that I saw is that the performance with atomic operations is two times faster than without atomic.

With atomic operation.

#include <stdio.h>
__global__ void add(int *a){
	int id = blockDim.x * blockIdx.x + threadIdx.x;
	int idx = id % 10;

	atomicAdd(&a[idx],10);
}

int main (int argc, char* argv[]){
	int n = 128000;
	size_t size = 10*sizeof(int);
	int *h_a = (int*)malloc(size);

	int i = 0;
	for (i = 0; i < 10; i++)
		h_a[i] = 0;

	int *d_a;
	cudaMalloc(&d_a, size);

	cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);

	add<<<n/128,128>>>(d_a);

	cudaMemcpy(h_a, d_a, size, cudaMemcpyDeviceToHost);

	for (i = 0; i < 10; i++)
		printf("%d ", h_a[i]);
	printf("\n");

	return 0;
}

Without atomic

__global__ void add(int *a){
	int id = blockDim.x * blockIdx.x + threadIdx.x;
	int idx = id % 10;

	a[idx] = a[idx] + 10;
}

So whats to reason for atomic operations been faster?

Questions about performance claims or analysis should generally include (in addition to the code):

  1. A description of the platfom, e.g. OS, version, etc.
  2. CUDA version
  3. compile command line
  4. GPU used for test
  5. timing methodology (if it’s not built into the code)

When I run your code on a GTX960 on Fedora20,CUDA 7.5, and use nvprof --print-gpu-trace for timing, I get a reported time of 42us for the non-atomic kernel, and 72us for the atomic version.

So according to my test, the atomic operations are not faster.

Compile command line:

nvcc -O3 -o t5 t5.cu -arch=sm_52

Ok. So Im using a cluster node with the following config:

  • Intel® Core™ i7-4790 @3.6 GHz;
  • Nvidia GTX 650 1 GB
  • 32 GB RAM DDR3;
  • Gigabyte GA-Z97X-SLI;
  • Linux Ubuntu Server 14.04.2 LTS 64 Bits (Kernel 3.13.0-57-generic);
  • SDK CUDA v6.5

Compiled with
nvcc atom.cu -o atom (tested with -arch=sm_30, no difference).

I used nvprof and the timing that I got for atomic version is

93.09%  26.304us         1  26.304us  26.304us  26.304us  add(int*)

and non atomic

96.37%  53.535us         1  53.535us  53.535us  53.535us  add(int*)

I also run both code on the GPU I have.

The first run was on a NVS 315 on Ubuntu, CUDA 7.0. The timing I got from nvprof is that non-atomic takes more time than atomic, which is obviously unacceptable. Theoretically, atomic operations make access serialize when multiple threads access the same address simultaneously which results in slower performance.

this is the result from NVS 315
without atomic

99.37%  375.80us         1  375.80us  375.80us  375.80us  add(int*)

with atomic

98.70%  181.54us         1  181.54us  181.54us  181.54us  add(int*)

I ran the code multiple times, also tried nvprof with --print-gpu-trace, and still got the same result like Dante003 which is atomic is faster than non-atomic.

The second run was on GTS 450 on Ubuntu, CUDA 7.0. The result from this GPU is that atomic is slower than non-atomic, like txbob’s test.

without atomic

95.74%  61.165us         1  61.165us  61.165us  61.165us  add(int*)

with atomic

98.32%  157.29us         1  157.29us  157.29us  157.29us  add(int*)

anyone has any thoughts?

I don’t have any answers but one difference to consider is that an atomicAdd() that ignores the return value is compiled to a “RED.ADD” reduction operation and, if my understanding of Kepler is correct, the atomic add will be processed external to the SMX.

This doesn’t explain your observation but might hint at an even more complex interaction between the SMX, L2, atomics, clock speed and the memory interface on the GK107 GPU.

But dumping the SASS is always illuminating…

The non-atomic version of your routine requires three instructions and a roundtrip to global memory:

/*0058*/                   LD R0, [R2];                        /* 0x8000000000201c85 */
/*0060*/                   IADD R0, R0, 0xa;                   /* 0x4800c00028001c03 */
/*0068*/                   ST [R2], R0;                        /* 0x9000000000201c85 */

The atomic version is one instruction and, depending on the arch and target memory space, may be performed by hardware external to the SMX:

/*0060*/                   RED.ADD [R0], R2;                   /* 0x1000000000009c05 */

There was a good discussion on this subject at GTC 2013: Understanding and Using Atomic Memory Operations [PDF][FLV][Stream].

Another thought, I wonder if you would see the same timings if your GTX 650 was running at maximum speed and/or you ran the test more than once?