So I did two simple programs to test atomic operations performance, as I expect a hit on performance because of the serialization that happens when using atomic functions.
So I implemented this code, the “main function” is the same, only the kernel changes. The problem that I saw is that the performance with atomic operations is two times faster than without atomic.
With atomic operation.
#include <stdio.h>
__global__ void add(int *a){
int id = blockDim.x * blockIdx.x + threadIdx.x;
int idx = id % 10;
atomicAdd(&a[idx],10);
}
int main (int argc, char* argv[]){
int n = 128000;
size_t size = 10*sizeof(int);
int *h_a = (int*)malloc(size);
int i = 0;
for (i = 0; i < 10; i++)
h_a[i] = 0;
int *d_a;
cudaMalloc(&d_a, size);
cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
add<<<n/128,128>>>(d_a);
cudaMemcpy(h_a, d_a, size, cudaMemcpyDeviceToHost);
for (i = 0; i < 10; i++)
printf("%d ", h_a[i]);
printf("\n");
return 0;
}
Without atomic
__global__ void add(int *a){
int id = blockDim.x * blockIdx.x + threadIdx.x;
int idx = id % 10;
a[idx] = a[idx] + 10;
}
So whats to reason for atomic operations been faster?
Questions about performance claims or analysis should generally include (in addition to the code):
A description of the platfom, e.g. OS, version, etc.
CUDA version
compile command line
GPU used for test
timing methodology (if it’s not built into the code)
When I run your code on a GTX960 on Fedora20,CUDA 7.5, and use nvprof --print-gpu-trace for timing, I get a reported time of 42us for the non-atomic kernel, and 72us for the atomic version.
So according to my test, the atomic operations are not faster.
The first run was on a NVS 315 on Ubuntu, CUDA 7.0. The timing I got from nvprof is that non-atomic takes more time than atomic, which is obviously unacceptable. Theoretically, atomic operations make access serialize when multiple threads access the same address simultaneously which results in slower performance.
I ran the code multiple times, also tried nvprof with --print-gpu-trace, and still got the same result like Dante003 which is atomic is faster than non-atomic.
The second run was on GTS 450 on Ubuntu, CUDA 7.0. The result from this GPU is that atomic is slower than non-atomic, like txbob’s test.
I don’t have any answers but one difference to consider is that an atomicAdd() that ignores the return value is compiled to a “RED.ADD” reduction operation and, if my understanding of Kepler is correct, the atomic add will be processed external to the SMX.
This doesn’t explain your observation but might hint at an even more complex interaction between the SMX, L2, atomics, clock speed and the memory interface on the GK107 GPU.
But dumping the SASS is always illuminating…
The non-atomic version of your routine requires three instructions and a roundtrip to global memory: