How much faster are atomicAdd() operations to __shared__ on SM >= 5X?

I worry I’m on the cusp of a defeat… I feel like I’ve been here before. My work is coming down to the idea that atomicAdd() to shared is faster than atomicAdd9) to global.

Can anyone tell me approximately how much faster it is to do atomicAdd() for integers (plain old 32 bit ints, I am confident i can use those) to shared memory, as opposed to similar operations atomically adding 32 bit ints to global memory, on Maxwell and beyond? Is it significantly faster, marginally faster, or just no longer a disaster–slower than atomicAdd() to global–on SM 5X and newer architectures?

Thanks!

It depends on various factors. See figure 3 here:

https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/

Thanks for showing this to me–I had seen the link in searches but always diverted into one of the other results, not realizing how much good information this application post had. It looks like shared atomics are definitely advantageous, and as always spreading the work across the 32 banks appears to be helpful as well (I can write my work units to do as much of that as possible). One other question I have in this line, are unsigned long long int atomics for shared also supported on SM 5X? The reading I’ve done suggests that 32-bit ints are what is provided for Maxwell, and 64-bit atomic operations are not supported until Pascal.

The programming guide indicates what is supported, on which architectures:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions

For example, atomicAdd:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd

The only restriction on types for that one is that 64-bit floating point atomic add is only supported on cc6.0 and higher.