atomicAdd: to shared memory / to global memory=====which is faster? (for Turing or later)

202476410arsmart · May 27, 2022, 9:23am

Hi! I have went through different type of atomicAdd and now my algorithm is stucked by this, so I am wondering what is the latest version for the latest strucutre? Such as Turing, Amphere or Volta?

The comparisons are:

Firstly atomicAdd to shared memory then normally write to global memory
Directly write to global memory

I noticed a lot of discussions in this topic, till Maxwell architecture…And the conclusion till then is: must use int32 can shared AtomicAdd faster than global. Not sure whether anything changed in later structure. (maybe not, see no info in volta/turing/amphere’s tuning guide…)

Thank you!!!

==========================================
Some puzzles are: I see in a early official post says to shared memory is faster, but one post report it is actually slow. And those are quite old…

Posts report shared is slower:

Official tip claims shared atomicAdd is faster

sz16 · August 22, 2024, 6:40pm

Wondering the same thing here, did you eventually reach any conclusions? @202476410arsmart

Curefab · August 22, 2024, 8:55pm

You could just try it out. There is a difference, whether you also use the return value of atomicAdd.

Also distinguish between latency in situations, where your memory infrastructure is idle and occupied 100%.
Also the maximum bandwidth can be different.
Access with all SMs the same memory to make sure that it goes through L2 instead of L1 (if that is possible for atomics).

202476410arsmart · August 23, 2024, 11:44am

Haha, I forgot it. I give up that project and turned to others now.

Topic		Replies	Views
AtomicAdd in Shared memory is measured slower than in Global memory. Timing, Shared memory, Atomic o CUDA Programming and Performance	2	8092	February 22, 2012
Shared memory atomicAdd is slower than that of using global memory CUDA Programming and Performance cuda	0	578	June 9, 2021
Atomic instructions on global and shared memory CUDA Programming and Performance	9	2789	May 27, 2022
How much faster are atomicAdd() operations to __shared__ on SM >= 5X? CUDA Programming and Performance	3	5071	October 24, 2017
Worse atomic performance in shared than global memory CUDA Programming and Performance	7	9166	August 3, 2017
Where do atomic operations go, and why are atomics to __shared__ faster than those to GMEM? CUDA Programming and Performance	6	3223	July 11, 2022
Performance difference between atomicAdd() on 32 bit words(float, int, unsigned int). CUDA Programming and Performance	2	918	June 13, 2015
AtomicAdd algorithm CUDA Programming and Performance	7	3935	August 25, 2009
Performance of Atomic operations CUDA Programming and Performance	2	2758	December 17, 2008
cuda by example Atomics CUDA Programming and Performance	1	494	December 17, 2019

atomicAdd: to shared memory / to global memory=====which is faster? (for Turing or later)

Related topics