How are atomics implemented on a hardware level?

CudaDevOps · March 4, 2023, 3:13pm

I have heard that every memory location in the DRAM has a queue into which the atomic operations get pushed into one after the other.
How is this queue implemented, and what is its size? And is there an upper limit on the number of atomic operations that can be executed parallelly at a point of time?(As the queue would be finite.) What happens when this queue’s limits are exceeded?
Thank you.

Robert_Crovella · March 6, 2023, 3:00pm

I don’t think you’ll find any detailed published information describing these things.

I’m not aware of queues such as you describe (per memory location. I’ve never heard that.). However most activities on the GPU do go into a queue or pipeline of some sort. Atomics would be handled initially by the LD/ST unit in the SM, and global atomics would ultimately be “resolved” (i.e. finished, completed) in the L2 cache.

There is no limit to the usage of atomics in CUDA device code. Nothing bad (e.g. illegal activity, incorrect results) happens if you use an arbitrarily large number of atomic operations in your device code.

If you study the nsight compute profiler documentation carefully, you may get some additional understanding of these things. If the LD/ST unit gets “overloaded” then it will throttle activity. Beyond that I don’t know of any detailed information.

At an extremely simplistic level, an instruction in a GPU will not get issued if the necessary conditions for its proper handling are not in place. Such an instruction (i.e. warp) is “stalled”, and the profiler (nsight compute) can report various information about stalls.

system · March 24, 2023, 6:08pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Atomic operation unit? CUDA Programming and Performance	7	3211	July 3, 2010
Performance of Atomic operations CUDA Programming and Performance	2	2683	December 17, 2008
Atomic Operations Latency / Throughput CUDA Programming and Performance	1	1478	November 27, 2009
Atomic operations and concurrent kernel launch CUDA Programming and Performance	0	595	December 23, 2013
Maximum number of concurrent atomicAdds on global memory CUDA Programming and Performance	4	1413	February 26, 2013
Tell me about atomics with mapped/zero-copy host memory CUDA Programming and Performance	1	1055	May 4, 2009
Atomic operation to a peer GPU’s memory? CUDA Programming and Performance cuda	0	573	August 31, 2020
Atomic operation in shared memory CUDA Programming and Performance	1	3817	August 12, 2008
Atomic operations for multi-GPU Is it possible to do that? CUDA Programming and Performance	9	8144	August 27, 2009
Atomic Operations in CUDA CUDA Programming and Performance	5	29235	June 9, 2009

How are atomics implemented on a hardware level?

Related topics