How are atomics implemented on a hardware level?

I have heard that every memory location in the DRAM has a queue into which the atomic operations get pushed into one after the other.
How is this queue implemented, and what is its size? And is there an upper limit on the number of atomic operations that can be executed parallelly at a point of time?(As the queue would be finite.) What happens when this queue’s limits are exceeded?
Thank you.

I don’t think you’ll find any detailed published information describing these things.

I’m not aware of queues such as you describe (per memory location. I’ve never heard that.). However most activities on the GPU do go into a queue or pipeline of some sort. Atomics would be handled initially by the LD/ST unit in the SM, and global atomics would ultimately be “resolved” (i.e. finished, completed) in the L2 cache.

There is no limit to the usage of atomics in CUDA device code. Nothing bad (e.g. illegal activity, incorrect results) happens if you use an arbitrarily large number of atomic operations in your device code.

If you study the nsight compute profiler documentation carefully, you may get some additional understanding of these things. If the LD/ST unit gets “overloaded” then it will throttle activity. Beyond that I don’t know of any detailed information.

At an extremely simplistic level, an instruction in a GPU will not get issued if the necessary conditions for its proper handling are not in place. Such an instruction (i.e. warp) is “stalled”, and the profiler (nsight compute) can report various information about stalls.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.