Is there a way to query the number of dedicated atomicAdd cores inside L2 cache?

Source for L2 atomic operations support: L2 cache .

How many cores in L2, or can we query the number?

With a GPU having 3 SM units (GT1030), merging local results on global memory atomically requires at most 3 atomic operations per variable in algorithms such as histogram. But on big GPUs with 170 SM units (RTX5090), the contention increases from 3 to 170.

For repeated or bigger local → global merges, this could become a scalability problem. Is there an official way to know if L2 cache can do N number of atomicAdds in parallel on different addresses so that we can compute the point to switch to two-pass version with a reduction kernel?

For example, in a GPU with only 1 SM unit, if L2 cache has 100 cores, then having a block size of 256 would be better than 128 (3 steps with 56 items on last vs 2 steps with 28 on last). With 3x work for 256 items vs 2x work for 128, increasing block size looks better.

Runtime apis like the following only takes occupancy into consideration:

cudaOccupancyMaxPotentialBlockSize(..)

but some algorithms are waiting on data like atomicAdd and still not decrease occupancy.

Would having more types of operations in L2 cache be useful for data-base acceleration? If we’re joining two tables, there are a lot of data to be processed but only the resulting rows (with same table index) will be actually required to move. Something similar to atomicAdd but like atomicSelectRowAndJoinTable maybe?

Kepler global atomics were substantially slower than Maxwell. Did it have atomics inside SM unit (locking it globally) instead of L2 for global variables?

What happens if we use CUDA-compressible memory enabled for a buffer and use atomics on it? Does the atomicAdd encode/decode the data for each atomic operation or does it decode once, increment many times, then encode once?


About atomicAdd parallelism in L2:

If we could query its capabilities, we could apply block-specialization to use atomicAdd directly on global memory for merging local results in some of CUDA blocks and a per-block output on remaining CUDA blocks for a reduction. So that the total time is minimized by balancing the load between L2 atomic cores and the available bandwidth (used by reduction kernel).

No API exists to my knowledge to query the throughput of L2 atomics (by data type). Moreover, the L2 is physically partitioned into L2 slices. Each slice is the point of coherence for a set of physical addresses. There is also no API to my knowledge for configuring the number of L2 slices or the stride between slices (which is >= 1 cache line on all GPUs). In order to achieve high atomic throughput atomics have to be evenly distributed across slices which means a fairly large address range. If you were to histogram into a single address the throughput would at maximum be 1 atomic update per cycle and you would have extremely unbalanced L2 slices utilization.

Due to the ratio of SM to L2 slices (usually 2-3:1 on consumer GPUs and 1.5-2:1 on 100 class GPUs) it is useful to do multi-level reductions to avoid contention.

1 Like

Thanks. But each L2 slice is in front of a memory controller and its own DRAM unit right?

Are DRAM units mapped to the memory addresses in an interleaved manner? If yes, then I think it is safe to say that all atomic units can be used by a single chunk of histogram such as an array of 100 elements (if there are 100 L2 slices / 100 memory controllers).

I remember GTX970 had one or two memory controllers not interleaved with the remaining. Would that cause more issues in an histogram algorithm without multi-level reductions?

For discrete GPUs L2 slices connect to 1 memory controller. There is often N slices per memory channel/pseudo channel. For integrated GPUs L2 slices are backed by an interconnect.

DRAM units are generally mapped in an interleaved manner. The mapping is not documented and may change per architecture/chip.

A L2 cache line is 128 bytes. 100 consecutive 4 byte elements would be in 1-4 slices depending on the interleaving stride.

2 Likes

Last question:

Then should we use padding to match 128bytes L2 cache line size if the atomic operations are for locking/unlocking purposes (such as some blocks locking neighboring items) (to avoid false-sharing)(is there even a false-sharing between blocks in there)?

I guess histograms wouldn’t benefit much from padding if they’re too large (like 1M bins –> padded becomes 32M elements, may not fit inside L2 cache and cause evictions to device memory).

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.