Atomic Operations and Clock Cycles

I was wondering if there was a table somewhere that lists the performance in operations/clock cycle of various CUDA functions. For instance, I saw a table in 5.4.1 of the latest CUDA guide but it’s only for arithmetic operations. I’m trying to determine the speed of atomicAdd.

Because atomicAdd() calls have to be serialized in the memory controller, atomicAdd()'s speed depends on the probability of multiple threads trying to access the same global memory location. There is a minimum number of clocks, assuming no collision, which has to be at least as slow as a global memory read or write. If you need to know the exact performance, you should try writing a small test kernel that mocks up your memory access pattern.