Source for L2 atomic operations support: L2 cache .
How many cores in L2, or can we query the number?
With a GPU having 3 SM units (GT1030), merging local results on global memory atomically requires at most 3 atomic operations per variable in algorithms such as histogram. But on big GPUs with 170 SM units (RTX5090), the contention increases from 3 to 170.
For repeated or bigger local → global merges, this could become a scalability problem. Is there an official way to know if L2 cache can do N number of atomicAdds in parallel on different addresses so that we can compute the point to switch to two-pass version with a reduction kernel?
For example, in a GPU with only 1 SM unit, if L2 cache has 100 cores, then having a block size of 256 would be better than 128 (3 steps with 56 items on last vs 2 steps with 28 on last). With 3x work for 256 items vs 2x work for 128, increasing block size looks better.
Runtime apis like the following only takes occupancy into consideration:
cudaOccupancyMaxPotentialBlockSize(..)
but some algorithms are waiting on data like atomicAdd and still not decrease occupancy.
Would having more types of operations in L2 cache be useful for data-base acceleration? If we’re joining two tables, there are a lot of data to be processed but only the resulting rows (with same table index) will be actually required to move. Something similar to atomicAdd but like atomicSelectRowAndJoinTable maybe?
Kepler global atomics were substantially slower than Maxwell. Did it have atomics inside SM unit (locking it globally) instead of L2 for global variables?
What happens if we use CUDA-compressible memory enabled for a buffer and use atomics on it? Does the atomicAdd encode/decode the data for each atomic operation or does it decode once, increment many times, then encode once?
About atomicAdd parallelism in L2:
If we could query its capabilities, we could apply block-specialization to use atomicAdd directly on global memory for merging local results in some of CUDA blocks and a per-block output on remaining CUDA blocks for a reduction. So that the total time is minimized by balancing the load between L2 atomic cores and the available bandwidth (used by reduction kernel).