How do atomic functions operate on shared memory?

I read some information about atomic instruction from these papers.
“atomic instructions are handled by special integer units attached to the L2 cache controller, not by the integer units in the CUDA cores.”
“While G80 had atomic instructions, by allowing these atomic values to be placed in the shared L2 cache in Fermi, atomic instructions can be 5X to 20X faster.”
According to these statements, atomic instructions are placed in the L2 cache controller.
I wonder how does it guarantee the atomic of accessing data in shared memory.