Performance of Atomic operations

RoofTopG · December 12, 2008, 3:40pm

How are the atomic operations on the global memory implemented? How many serialized memory accesses will be generated per warp?

Is it possible to get any information related to the atomics with a cuda profiler or some other tool?

What about atomics on the shared memory? How do they compare to the gm. atomics in terms of performance? Is it a good optimization approach to use atomics on the SM, and tehn regular stores to the GM?

And finally, what’s the performance of atomics in comparison to already optimized primitives such as reduction like?

alex_dubinsky · December 13, 2008, 2:44am

Those are really good questions. I can’t answer most of them, except to say an atomic on shared memory is 16x slower than an ordinary smem access. This is obviously far faster than a gmem atomic. Smem atomics, like smem/cmem conflicts, cause “serializations” (the threads in a half warp can’t run in parallel), and these show up in the profiler as such.

asm · December 17, 2008, 3:05pm

I used atomic adds on global memory, in my experience there was smth like 10% preformance boost when atomic instructions are replaced by

ordinary global memory writes. Although this should be program-dependent…

As alex.dubinsky said, they cause lots of warp serializations when viewed in a profiler.

One can implement inter-block synchronization using atomics, which was lacking on sm1.0 architectures

(ie., save on kernel launches, but at the expense of more lengthy kernels and possible increase of register count).

Also look like there is an undocumented PTX instruction ‘membar’ (which appeared on sm1.1 arch or later) and is probably related to serialization of memory access, but

the lack of concrete syntax didn’t allow me to test it…

Topic		Replies	Views
AtomicAdd in Shared memory is measured slower than in Global memory. Timing, Shared memory, Atomic o CUDA Programming and Performance	2	8111	February 22, 2012
Where do atomic operations go, and why are atomics to __shared__ faster than those to GMEM? CUDA Programming and Performance	6	3418	July 11, 2022
atomicAdd: to shared memory / to global memory=====which is faster? (for Turing or later) CUDA Programming and Performance	3	1592	August 23, 2024
Worse atomic performance in shared than global memory CUDA Programming and Performance	7	9228	August 3, 2017
Taking apart global atomics performance performance, graphs, theories CUDA Programming and Performance	23	7999	January 28, 2012
cuda by example Atomics CUDA Programming and Performance	1	500	December 17, 2019
Atomic Operations CUDA Programming and Performance	4	4740	November 11, 2015
Shared memory atomics and other questions. CUDA Programming and Performance	19	13953	November 13, 2010
hiding global memory access do I need 2 warps? CUDA Programming and Performance	1	998	January 22, 2010
Fermi atomic op 10 times slower than ATI GPU? CUDA Programming and Performance	4	10063	July 25, 2011

Performance of Atomic operations

Related topics