Performance of Atomic operations

How are the atomic operations on the global memory implemented? How many serialized memory accesses will be generated per warp?

Is it possible to get any information related to the atomics with a cuda profiler or some other tool?

What about atomics on the shared memory? How do they compare to the gm. atomics in terms of performance? Is it a good optimization approach to use atomics on the SM, and tehn regular stores to the GM?

And finally, what’s the performance of atomics in comparison to already optimized primitives such as reduction like?

Those are really good questions. I can’t answer most of them, except to say an atomic on shared memory is 16x slower than an ordinary smem access. This is obviously far faster than a gmem atomic. Smem atomics, like smem/cmem conflicts, cause “serializations” (the threads in a half warp can’t run in parallel), and these show up in the profiler as such.

I used atomic adds on global memory, in my experience there was smth like 10% preformance boost when atomic instructions are replaced by

ordinary global memory writes. Although this should be program-dependent…

As alex.dubinsky said, they cause lots of warp serializations when viewed in a profiler.

One can implement inter-block synchronization using atomics, which was lacking on sm1.0 architectures

(ie., save on kernel launches, but at the expense of more lengthy kernels and possible increase of register count).

Also look like there is an undocumented PTX instruction ‘membar’ (which appeared on sm1.1 arch or later) and is probably related to serialization of memory access, but

the lack of concrete syntax didn’t allow me to test it…