How efficient is the atomic operations?

How many cycles will it take for atomic operations like atomicCAS()? For example, if two threads executes atomicCAS() to the same address in global memory at the same time, how long will it take for the two operations to complete? And what if there are more threads? Will the total time consumed grow linearly or exponentially?

Simultaneous accesses to the same address are serialized, so I would expect the time to grow linearly with the number of collisions. But as usual, the only way to get an accurate answer is to measure it.