Is there a way to query the number of dedicated atomicAdd cores inside L2 cache?

tugrul_192bit · September 18, 2025, 11:38am

Source for L2 atomic operations support: L2 cache .

How many cores in L2, or can we query the number?

With a GPU having 3 SM units (GT1030), merging local results on global memory atomically requires at most 3 atomic operations per variable in algorithms such as histogram. But on big GPUs with 170 SM units (RTX5090), the contention increases from 3 to 170.

For repeated or bigger local → global merges, this could become a scalability problem. Is there an official way to know if L2 cache can do N number of atomicAdds in parallel on different addresses so that we can compute the point to switch to two-pass version with a reduction kernel?

For example, in a GPU with only 1 SM unit, if L2 cache has 100 cores, then having a block size of 256 would be better than 128 (3 steps with 56 items on last vs 2 steps with 28 on last). With 3x work for 256 items vs 2x work for 128, increasing block size looks better.

Runtime apis like the following only takes occupancy into consideration:

cudaOccupancyMaxPotentialBlockSize(..)

but some algorithms are waiting on data like atomicAdd and still not decrease occupancy.

Would having more types of operations in L2 cache be useful for data-base acceleration? If we’re joining two tables, there are a lot of data to be processed but only the resulting rows (with same table index) will be actually required to move. Something similar to atomicAdd but like atomicSelectRowAndJoinTable maybe?

Kepler global atomics were substantially slower than Maxwell. Did it have atomics inside SM unit (locking it globally) instead of L2 for global variables?

What happens if we use CUDA-compressible memory enabled for a buffer and use atomics on it? Does the atomicAdd encode/decode the data for each atomic operation or does it decode once, increment many times, then encode once?

About atomicAdd parallelism in L2:

If we could query its capabilities, we could apply block-specialization to use atomicAdd directly on global memory for merging local results in some of CUDA blocks and a per-block output on remaining CUDA blocks for a reduction. So that the total time is minimized by balancing the load between L2 atomic cores and the available bandwidth (used by reduction kernel).

Greg · September 25, 2025, 4:08am

No API exists to my knowledge to query the throughput of L2 atomics (by data type). Moreover, the L2 is physically partitioned into L2 slices. Each slice is the point of coherence for a set of physical addresses. There is also no API to my knowledge for configuring the number of L2 slices or the stride between slices (which is >= 1 cache line on all GPUs). In order to achieve high atomic throughput atomics have to be evenly distributed across slices which means a fairly large address range. If you were to histogram into a single address the throughput would at maximum be 1 atomic update per cycle and you would have extremely unbalanced L2 slices utilization.

Due to the ratio of SM to L2 slices (usually 2-3:1 on consumer GPUs and 1.5-2:1 on 100 class GPUs) it is useful to do multi-level reductions to avoid contention.

tugrul_192bit · September 25, 2025, 7:28am

Thanks. But each L2 slice is in front of a memory controller and its own DRAM unit right?

Are DRAM units mapped to the memory addresses in an interleaved manner? If yes, then I think it is safe to say that all atomic units can be used by a single chunk of histogram such as an array of 100 elements (if there are 100 L2 slices / 100 memory controllers).

I remember GTX970 had one or two memory controllers not interleaved with the remaining. Would that cause more issues in an histogram algorithm without multi-level reductions?

Greg · September 25, 2025, 7:55am

For discrete GPUs L2 slices connect to 1 memory controller. There is often N slices per memory channel/pseudo channel. For integrated GPUs L2 slices are backed by an interconnect.

DRAM units are generally mapped in an interleaved manner. The mapping is not documented and may change per architecture/chip.

A L2 cache line is 128 bytes. 100 consecutive 4 byte elements would be in 1-4 slices depending on the interleaving stride.

tugrul_192bit · September 25, 2025, 8:25am

Last question:

Then should we use padding to match 128bytes L2 cache line size if the atomic operations are for locking/unlocking purposes (such as some blocks locking neighboring items) (to avoid false-sharing)(is there even a false-sharing between blocks in there)?

I guess histograms wouldn’t benefit much from padding if they’re too large (like 1M bins –> padded becomes 32M elements, may not fit inside L2 cache and cause evictions to device memory).

system · October 9, 2025, 8:25am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Atomic operation unit? CUDA Programming and Performance	7	3329	July 3, 2010
About atomicAdd CUDA Programming and Performance	7	2011	March 28, 2025
Fermi L2 cache How fast is the L2 cache? How do I access it? CUDA Programming and Performance	11	26345	December 2, 2011
Where do atomic operations go, and why are atomics to __shared__ faster than those to GMEM? CUDA Programming and Performance	6	3145	July 11, 2022
How do atomic functions operate on shared memory? CUDA Programming and Performance	0	675	July 27, 2010
Anyway to force several bytes to be in L1/L2 cache so that I can use it across multiple threadblocks within one kernel? CUDA Programming and Performance	2	495	June 24, 2022
Maximum number of concurrent atomicAdds on global memory CUDA Programming and Performance	4	1480	February 26, 2013
Load or L2 Bottleneck? CUDA Programming and Performance	3	1240	April 17, 2017
Can I disable L2 caching? CUDA Programming and Performance	3	2569	May 27, 2015
How to build "shared activation memory buffer" in L2 cache? CUDA Programming and Performance	0	295	December 10, 2023

Is there a way to query the number of dedicated atomicAdd cores inside L2 cache?

Related topics