Hi! I have went through different type of atomicAdd and now my algorithm is stucked by this, so I am wondering what is the latest version for the latest strucutre? Such as Turing, Amphere or Volta?
The comparisons are:
Firstly atomicAdd to shared memory then normally write to global memory
Directly write to global memory
I noticed a lot of discussions in this topic, till Maxwell architecture…And the conclusion till then is: must use int32 can shared AtomicAdd faster than global. Not sure whether anything changed in later structure. (maybe not, see no info in volta/turing/amphere’s tuning guide…)
Some puzzles are: I see in a early official post says to shared memory is faster, but one post report it is actually slow. And those are quite old…
Posts report shared is slower:
Official tip claims shared atomicAdd is faster