atomicAdd: to shared memory / to global memory=====which is faster? (for Turing or later)

Hi! I have went through different type of atomicAdd and now my algorithm is stucked by this, so I am wondering what is the latest version for the latest strucutre? Such as Turing, Amphere or Volta?

The comparisons are:

  1. Firstly atomicAdd to shared memory then normally write to global memory

  2. Directly write to global memory

I noticed a lot of discussions in this topic, till Maxwell architecture…And the conclusion till then is: must use int32 can shared AtomicAdd faster than global. Not sure whether anything changed in later structure. (maybe not, see no info in volta/turing/amphere’s tuning guide…)

Thank you!!!

==========================================
Some puzzles are: I see in a early official post says to shared memory is faster, but one post report it is actually slow. And those are quite old…

Posts report shared is slower:

Official tip claims shared atomicAdd is faster

1 Like

Wondering the same thing here, did you eventually reach any conclusions? @202476410arsmart

1 Like

You could just try it out. There is a difference, whether you also use the return value of atomicAdd.

Also distinguish between latency in situations, where your memory infrastructure is idle and occupied 100%.
Also the maximum bandwidth can be different.
Access with all SMs the same memory to make sure that it goes through L2 instead of L1 (if that is possible for atomics).

2 Likes

Haha, I forgot it. I give up that project and turned to others now.

1 Like