For both Fermi and Kepler, what is the maximum number of 32-bit atomic operations (on global memory) that can be performed truly concurrently? I mean under ideal conditions, where all of the global memory addresses are unique.
You meen when all ‘atomicAdd’ calls are being done on separate memory locations? Would it be even required in such situation? Any atomic operation is most efficient when applied for relatively small number of threads accessing same memory location. This page may be of help.
Yes, I do mean that all atomic calls are being done on separate locations – this would give me an idea what the best possible performance could be. I am considering using atomics in a situation where I have to update a lot of global memory locations, with a small but non-zero probability that there would be collisions. Every thread would update one location. The page you gave above concerns reductions, where a small number of threads update the same memory location, which is not the same situation.
I am told that devices have a limited number of “atomic units”, the hardware that performs the operations. If I knew that number, I would know how many operations can be performed concurrently.
I accidentally stumbled on the answer to this question in the GTX 680 Whitepaper:
Page 12 says that GF110 can do 24 atomic operations to independent addresses per clock (which clock??), whereas GK104 can do 64 per clock.
Thanks, that’s exactly what I was looking for!