Hi! I have went through different type of atomicAdd and now my algorithm is stucked by this, so I am wondering what is the latest version for the latest strucutre? Such as Turing, Amphere or Volta?
The comparisons are:
Firstly atomicAdd to shared memory then normally write to global memory
Directly write to global memory
I noticed a lot of discussions in this topic, till Maxwell architecture…And the conclusion till then is: must use int32 can shared AtomicAdd faster than global. Not sure whether anything changed in later structure. (maybe not, see no info in volta/turing/amphere’s tuning guide…)
Thank you!!!
==========================================
Some puzzles are: I see in a early official post says to shared memory is faster, but one post report it is actually slow. And those are quite old…
You could just try it out. There is a difference, whether you also use the return value of atomicAdd.
Also distinguish between latency in situations, where your memory infrastructure is idle and occupied 100%.
Also the maximum bandwidth can be different.
Access with all SMs the same memory to make sure that it goes through L2 instead of L1 (if that is possible for atomics).