Atomic operation unit?

I wonder how atomic operation is operated.
Also, I think that atomic operation is operated by hardware unit.
If atomic units are hardware, where do they existe in GPU and what is the difference of atomic unit between GT200 and Fermi??

Yes, it does appear that special hardware is used, since the original compute capability 1.0 devices could not perform atomic operations. Compute capability 1.1 devices can perform atomic operations on global memory, 1.2 adds shared memory and 64-bit operations, and finally 2.0 (Fermi) adds atomic floating point operations.

My suspicion (although this is not confirmed by any official statement from NVIDIA) is that the atomic logic for global memory was placed in the memory controller, since that would be the easiest way to ensure atomic safety across all multiprocessors. This also would explain why only very limited operations (exchange, addition/subtraction, min, max) are available.

In compute capability 2.0 devices, atomic operations on global memory are extremely fast because they are serviced by the L2 cache rather than the global memory directly.

Yes, it does appear that special hardware is used, since the original compute capability 1.0 devices could not perform atomic operations. Compute capability 1.1 devices can perform atomic operations on global memory, 1.2 adds shared memory and 64-bit operations, and finally 2.0 (Fermi) adds atomic floating point operations.

My suspicion (although this is not confirmed by any official statement from NVIDIA) is that the atomic logic for global memory was placed in the memory controller, since that would be the easiest way to ensure atomic safety across all multiprocessors. This also would explain why only very limited operations (exchange, addition/subtraction, min, max) are available.

In compute capability 2.0 devices, atomic operations on global memory are extremely fast because they are serviced by the L2 cache rather than the global memory directly.

Thank you for your reply.

I read some information about atomic instruction from these papers.

“atomic instructions are handled by special integer units attached to the L2 cache controller, not by the integer units in the CUDA cores.”

“While G80 had atomic instructions, by allowing these atomic values to be placed in the shared L2 cache in Fermi, atomic instructions can be 5X to 20X faster.”

According to these statements, I think the atomic instructions are placed in the L2 cache controller.

But I still wonder how does it guarantee the atomic of accessing data in L1 cache.

The atomic instructions in the L2 cache controller wouldn’t be able to control the access to L1 cache.

And one more thing, where is the L2 cache controller placed in?

Is it in the multiprosser or outside of the multiprocessor?

Thank you for your reply.

I read some information about atomic instruction from these papers.

“atomic instructions are handled by special integer units attached to the L2 cache controller, not by the integer units in the CUDA cores.”

“While G80 had atomic instructions, by allowing these atomic values to be placed in the shared L2 cache in Fermi, atomic instructions can be 5X to 20X faster.”

According to these statements, I think the atomic instructions are placed in the L2 cache controller.

But I still wonder how does it guarantee the atomic of accessing data in L1 cache.

The atomic instructions in the L2 cache controller wouldn’t be able to control the access to L1 cache.

And one more thing, where is the L2 cache controller placed in?

Is it in the multiprosser or outside of the multiprocessor?

I am pretty sure atomic operations have to bypass the L1 cache in order to maintain coherency across the chip. (We know this is possible since PTX 2.0 includes modifiers to perform global reads and writes that bypass the L1 and/or L2 cache.)

Again, no hard information, but it would only make sense for it to be outside the multiprocessor.

I am pretty sure atomic operations have to bypass the L1 cache in order to maintain coherency across the chip. (We know this is possible since PTX 2.0 includes modifiers to perform global reads and writes that bypass the L1 and/or L2 cache.)

Again, no hard information, but it would only make sense for it to be outside the multiprocessor.

Thank you.

Your reply was very helpful.