How's atomic operations in CUDA implemented?

I recreated my UFO observation, but I could only do it when using shared memory, which was the last thing I tried, of course.

Using global memory for the victim variable worked as you said it should work in every case I tried: single blocks, multiple blocks, multiple concurrent kernels, etc.

I still don’t understand why the code in the post below doesn’t work with a volatile write.
It works fine if I substitute atomicExch() for the volatile write.

Given King_Crimson’s reply, I don’t think this is a bug as far as C++ goes, but I still find it an interesting phenomenon.