atomicAdd: to shared memory / to global memory=====which is faster? (for Turing or later)

Hi! I have went through different type of atomicAdd and now my algorithm is stucked by this, so I am wondering what is the latest version for the latest strucutre? Such as Turing, Amphere or Volta?

The comparisons are:

  1. Firstly atomicAdd to shared memory then normally write to global memory

  2. Directly write to global memory

I noticed a lot of discussions in this topic, till Maxwell architecture…And the conclusion till then is: must use int32 can shared AtomicAdd faster than global. Not sure whether anything changed in later structure. (maybe not, see no info in volta/turing/amphere’s tuning guide…)

Thank you!!!

Some puzzles are: I see in a early official post says to shared memory is faster, but one post report it is actually slow. And those are quite old…

Posts report shared is slower:

Official tip claims shared atomicAdd is faster