How can we achieve maximum L2 cache throughput on an RTX 3080 GPU?

I am optimizing a program in which each SM runs one block. The pseudocode is as follows::

//Located in gmem. Each element is a 512-byte block. T can be tuned to an //appropriate value in order to achieve optimal L2 throughput performance.
B[8][T];
__shared__ A[64KB]
load A from gmem
//warp id in block
warp_id = threadIdx.x / 32
for i in [0..T]:
    data_b <- B[warp_id][i]
    data_a <- A[XX]
    calculate data op data_a

As shown in the pseudocode, each block reads the same B array data, while different warps within the same block read different portions of the B array. Each block reads the B array only once. Since the B array is relatively small, it can be fully cached in L2. The A data, on the other hand, is entirely cached in shared memory. Given this memory access pattern, how should the B array be arranged in memory in order to achieve maximum L2 throughput? I estimate that the actual bandwidth achievable from the L2 cache will determine the upper bound of the overall program performance.

The reason I believe that the data layout of B determines the achievable upper bound of L2 throughput is that I understand that on the RTX 3080 GPU I am currently using, there are a total of 10 L2 partitions, and each cache line is stored in one of these partitions. Each partition also contains internal sub-partitions. To achieve optimal performance, data accesses must be evenly distributed across the sub-partitions of all L2 partitions. However, I do not know how addresses are mapped to these sub-partitions.

All of the above understanding is based on responses generated by AI; please point out any inaccuracies if present.

GPU: RTX 3080

Additionally, I have another question: is there any difference in the achievable L2 bandwidth when using the cp.async.cg instruction to copy data from L2 to shared memory compared to directly using ld instructions to access L2(Assume no cache hit on L1)?

Robert and Greg know the details about the internal architecture in more detail than me.

You can use Nsight Compute to measure the current cache hit rates and achieved throughput compared to the maximum possible.

If you want to get maximum L2 throughput, you should read sectors of 32 bytes or probably slightly better (as they share tags) full cache lines of 128 bytes at the same time. The L1 cache will compensate to a certain degree for non-optimal accesses.
[corrected the sizes according to Nanodeoclus’ mesasge]

If each element is a 512 bytes block, then this line
data_b <- B[warp_id][i]
for a single i reads 512 bytes?

How is it distributed on the threads of the warp? Each threads reads 128 bits (e.g. uint4 or float4)?

I would guess that your program will exhibit nearly optimal performance from the start and that there will be little you can do to improve L2 performance more (talking about small single-digit percentages perhaps).

As long as your LD/ST (Load Store) units in the SM are not fully occupied.
cp.async.cg probably won’t give an improvement.

Different approaches to improve your program could be

  • to lower the number of or size of necessary L2 accesses, e.g. by compression, or a different algorithm like fusing several iteration steps of the algorithm
  • to use a Blackwell class device for distributed shared memory
  • to try to keep your data in the combined L1s or shared memory of the SMs - 68 * 128 KiB L1/SM = 8.5 MiB; however, L1 and shared memory gets erased between kernel calls

Thank you for your reply.

B[warp_id][i] actually corresponds to the A-matrix tile used by the mma.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32 instruction on the RTX 3080 Tensor Cores. At the moment, I store it in global memory using the layout required by the MMA instruction for the A matrix.

Each time, I use the following instruction to copy the data from global memory to shared memory:

cp.async.cg.shared.global [%0], [%1], 16

In order to measure the upper bound of the achievable L2 bandwidth, I asked an AI assistant (Claude Code + Opus 4.8) to write a microbenchmark. The measured result was only around 1.1 TB/s.

I am not sure whether this is because I am not using cp.async correctly, because the L2 partitions are not being utilized evenly, or because both factors are contributing to the result.

Could you please explain what you mean by the above statement?

My current understanding is that both ld instructions and cp.async instructions are dispatched to the MIO instruction queue and wait there to be processed. Based on what you said, is it correct to understand that ld instructions are handled by the conventional LD/ST units, while cp.async is processed by a different hardware unit?

I asked an AI assistant to benchmark the L2 bandwidth under two different memory access patterns. The pseudocode is as follows:

constants:

GROUP, NG                 # commit every GROUP copies; keep NG groups in flight

SLOTS = GROUP \* NG        # 16B copies issued per thread per iteration

REGION = 5 MB             # global region == L2 size, L2-resident after warmup

each thread:

smem_addr\[SLOTS\]          # per-warp CONTIGUOUS shared staging (same in both)

gmem_addr\[SLOTS\]          # <-- THE ONLY DIFFERENCE  (sections 1 vs 2)

repeat iters times:

    for grp in 0..NG-1:

        for c in 0..GROUP-1:

            cp.async.cg  smem_addr\[grp\*GROUP + c\] <- gmem_addr\[grp\*GROUP + c\]   # 16B, bypass L1 -> L2

        cp.async.commit_group

        cp.async.wait_group(NG-1)         # keep (NG-1)\*GROUP copies outstanding

    \# addresses are re-read every iteration (volatile asm => not optimized away)

cp.async.wait_group(0)

A warp’s 32 lanes always read `lane*16` => one **contiguous, 512B-aligned**

block per (warp, slot). Coalescing and alignment are therefore the same in both

kernels. Only the *spatial layout of the SLOTS blocks* differs.
bandwidth = blocks * threads * iters * SLOTS * 16 / elapsed

1. Pattern (A) — scattered → ~1.1 TB/s

gmem_addr(warp w, lane L, slot s) =

        g + ( w*512 + L*16 + s*(total*16) )  mod  REGION

                              ~~~~~~~~~~~~

                              huge per-slot stride:  total*16 = gridDim*blockDim*16 ~= 557 KB


per warp : SLOTS blocks, each ~557 KB apart  -> fanned out across the whole 5 MB

Global 5 MB, drawn as 512B blocks:

 0        ~0.5M     ~1.1M     ~1.7M             ~4.5M

 |---------|---------|---------|---- ... -------|

 warp 0 : [s0]      [s1]      [s2]    ...       [s7]

 warp 1 :  [s0]      [s1]      [s2]    ...        [s7]      (+512B)

 warp 2 :   [s0]      [s1] ...

           \____________ at slot 0 every warp is clustered here ____________/


A warp’s 8 accesses fan out over ~4.5 MB; at a given slot index all warps sit

in one narrow window that marches across the region slot by slot.

## 2. Pattern (B) — contiguous (GEMM A-column / B-row style) → ~1.62 TB/s

bpw = floor(REGION / num_warps / 512)          # contiguous 512B blocks per warp (e.g. 9)

gmem_addr(global warp W, lane L, slot s) =

        g + W*(bpw*512) + L*16 + (s mod bpw)*512

            ~~~~~~~~~~~

            each warp owns a PRIVATE, CONTIGUOUS, NON-OVERLAPPING region

per warp : bpw contiguous 512B blocks (~4.6 KB); slots cycle inside that region

Global 5 MB, partitioned into per-warp contiguous regions (~4.6 KB each):

 0       4.6K     9.2K    13.8K                      5M

 |--------|--------|--------|----- ... ------|--------|

 warp 0 : [b0..b8]

 warp 1 :          [b0..b8]

 warp 2 :                   [b0..b8]

   ...                                                [b0..b8]  warp (N-1)

Each warp stays inside its own tight region; the warps tile the full 5 MB.

At any instant the warps are spread across all of L2 => every slice stays busy.

The complete benchmarking code generated by the AI:

l2_read_pattern_compare.txt (11.8 KB)

One more question:

How can the theoretical peak bandwidth of the L2 cache be estimated? For example, can we derive it by knowing the interface width and operating frequency of each L2 partition on the SM-facing side and then calculating the aggregate bandwidth ourselves?

For the optimization approaches you mentioned:

  1. The data I load is used for matrix multiplication, so I am not aware of any compression algorithms that would be applicable in this case.

  2. My goal is specifically to develop the highest-performance implementation for architectures starting from Ampere onward, so I cannot rely solely on moving to newer hardware to improve performance. At this point, I have already used AI to help develop an implementation that achieves approximately 95% Tensor Core utilization. However, I would like to rethink the algorithm and its implementation in order to push Tensor Core utilization to 99%, thereby further improving performance. The algorithm consists entirely of matrix operations, and the overall performance of the program is directly determined by Tensor Core utilization.

  3. I have already built a three-level data caching hierarchy using registers, shared memory, and L2 cache to store data. Based on my analysis, the L2 cache is under the greatest bandwidth pressure. Therefore, what I need to understand now is: what is the maximum achievable L2 bandwidth, and what access pattern is required to reach that maximum bandwidth?

So one important question is, why is pattern B faster with 1.62 TB/s?

Is some resource occupied too much in A?
I would guess that the stride in A is specifically chosen to use the same memory partition repeatedly.
Also for A all SMs access the same memory addresses, in B the address is dependent on the blockIdx.

So probably it is just an ā€˜unrealistic bad benchmark’ and in practice you get the B numbers.

Nsight Compute (please use it) may have the available bandwidth stored. There also may be benchmarks out there.

There is no public fixed speed rule to compute L2 max bandwidth, but for GPUs of the same architecture it should scale like global memory speed.

Typically it is 2x to 4x global memory speed, depending on the GPU.

The number of bits of the memory interface (every 32 for GDDR, every 512 bits for HBM) determine the number of memory partitions, each responsible for part of the L2 cache.

Sorry, I discovered that the A pattern implemented by the AI contained a serious flaw. It caused threads with the same thread index in different blocks to access exactly the same addresses in each iteration. After correcting this issue, the benchmark was also able to achieve approximately 1.6 TB/s.

However, I still have one question: how are global memory addresses mapped to different L2 partitions?

In the future, when I need to cache a relatively small amount of data in L2, I might consider creating multiple copies of that data so that they are distributed more evenly across different L2 partitions, thereby maximizing the achievable L2 bandwidth.

I believe sectors are 32B and cache lines (4 sectors with shared tags) are 128B.

Take the number of L2 cache slices (reported as device__attribute_l2s_count in Nsight Compute; Should be 40 for RTX 3080), multiply it by the achieved core frequency (boost clock for RTX 3080 seems to be 1710MHz) times 32B. Max L2 bandwidth is thus 1.99TiB/s.

This is undocumented and may differ between the various memory/cache configurations even for the same chip. However, it is clear that the scheme generally aims to achieve even distribution of any larger contiguous block of memory across all L2 slices (and, implicitly tied to that, all DRAM modules). 40*32B is only 1.25KiB. If the data you want to cache is much larger than that it will very likely be efficiently distributed. If it’s smaller than that, you might want to keep local (L1/smem/registers) copies of it rather than rely on L2.

Nvidia tries to optimize it in a way that for all practical workloads, the accesses are balanced across the memory partitions.

An additional question is, whether you really have the physical memory address available or a virtual one (with perhaps the lower bits being reused for both).

I will give you some pointers into that topic:

SGDRC: Software-Defined Dynamic Resource Control for Concurrent DNN Inference on NVIDIA GPUs
(the L2 cache mapping is one topic in their paper)

https://arxiv.org/pdf/2407.13996

FGPU [their project Fractional GPU] assumes that the GPU L2
cacheline and DRAM bank hash mapping functions are pure
XOR functions. We attempted to reverse engineer other
GPUs using FGPU’s approach, but all failed because this
assumption does not hold for many NVIDIA GPUs. The key
issue lies in the nature of the XOR function: it is linear, map-
ping a VRAM space of size 2š‘ bytes to 2š‘€ VRAM channels.
However, many GPUs use non-linear VRAM channel hash
mappings, which map a VRAM space of arbitrary size to
an arbitrary number of VRAM channels. This non-linearity
arises because the VRAM size and the number of VRAM
channels are often not powers of 2 (Tab. 1). The number of
VRAM channels can be cross-validated by the number of
GDDR chips on the GPU (Fig.18 in §A.2) and the theoretical
calculation (i.e., VRAM bus width divided by the bus width
per memory unit). Furthermore, FGPU only supports page
coloring based on 4 KiB granularity, the minimum page size
supported by NVIDIA GPU’s Memory Management Unit
(MMU) [ 37]. However, this is inapplicable to newer GPU
architectures (§5.2).

Reverse Engineering GPU L2 cache/DRAM structure
Here you see the mapping for the GTX 1070, GTX 1080 (both Pascal) and V100 (Volta)
And they provide the source code to find out the mapping for other GPUs.

Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling

That is a GPU simulator discussing that topic. Their solution could be identical in typical performance, but could use a different mapping.

Our GPU cache model supports a throughput-oriented,
banked, and sectored cache design [32], [62]. The cache is
flexible enough to model GPUs from Kepler through Turing.
We also model the CPU-GPU memory copy engine, since
all DRAM accesses go through the L2, including CPU-GPU
memory copies [53]. To reduce unbalanced memory accesses
across L2 memory partitions, which we refer to as partition
camping [3], [33], we add advanced partition indexing that
xors the L2 bank bits with randomly selected bits from the
page row bits using a Galois-based irreducible polynomial
(IPOLY) interleaving mechanism [56]. Partition camping is
a major problem in contemporary GPUs that have 2n memory
partitions, like High Bandwidth Memory (HBM) [24] which
has 8 channels per stack [33], [51]. IPOLY hashing is guaran-
teed to be conflict-free for all 2n strides, which are common in
GPGPU applications, and also shows reasonable, deterministic
performance for other strides [56].

Sure. Probably was too tired. Sorry and thank you.

That is interesting - the size of L2 is proportional to the memory interface width.
E.g. for Ampere it is 512 KiB L2 size per 32 bits memory interface (for GDDR).
The RTX 3080 with its 320 bits has 10 x 512 KiB = 5 MiB of L2 cache.
Then probably device__attribute_l2s_count is always 4x as high?

You say it can do 32B per cycle, that is 4x32B = 4 sectors per memory interface. That makes sense.

With HBM memory interfaces it seems to be 4x as much:

An A100 has 5120 bits memory interface, which is 10x512 bits (512 bits per memory interface with HBM).
Officially the A100 L2 has a max. bandwidth of 5120 bytes per clock, so 512 bytes per clock per memory interface (16 sectors) instead of 128 bytes per clock (4 sectors).

Given that the theoretical peak bandwidth is 2 TB/s, I’m now even more confused about why I’m only able to achieve 1.6 TB/s of bandwidth.

I find your pseudocode a bit hard to follow, but since the GPU has a total of exactly 5 MiB of L2 you probably want to keep your active set a bit smaller than that, since there is bound to be at least some extra memory traffic (kernel parameters, instructions, etc.) over which you have no control. I’d suggest running it again with 2 or 4 MiB.

Ah yes, what I wrote above may be incorrect for 100-class GPUs starting from A100. I.e. A100 reports 80 L2 slices, but indeed the A100 whitepaper says it has an overall L2 bandwidth of 5120B/cycle (compared to V100’s 2048B/cycle), which would be 64B per slice.

This gets complicated by the way L2 is partitioned, though, such that a first read from a cache line belonging to the far partition gets cached in the near partition, requiring space and bandwidth from both partitions.

Trying to get the terms (more) correctly now.

The actual DDR or GDDR memory is connected via memory controllers. They are shown in the Nvidia architeccture diagrams surrounding the GPCs.

Each GDDR memory controller handles 32 bits, each HBM memory controller handles 512 bits. So from the bitwidth of the memory interface and the memory type, one can calculate the number of memory controllers.

It seems (empirically) that the memory controllers also handle the L2 caches.
Each memory controller for any GPU of the same compute architecture has a fixed size of L2 cache.

And each GDDR memory controller has 4 L2 slices with 32 bytes per cycle each.

Each HBM memory controller has 8 L2 slices with 64 bytes per cycle bandwidth each.

I wouldn’t say that ā€œthe memory controllers also handle the L2 cachesā€, as load/store ops interact with L2 first. But there’s a fixed association between memory controllers and L2 cache slices.

Not V100, though, which appears to have a 4096bit HBM2 memory interface and a unified L2 cache with 64 slices and a bandwidth of 32B/cycle per slice. It’s the later GPUs that have a partitioned L2 cache which increase bandwidth to 64B/cycle per slice.

So the partitioning is per memory controller? Meaning each memory controller is directly connected to both L2 partitions (and not half of memory controllers to each partition).

Or, actually, as we are talking about 32 vs. 64B/cycle/slice, each L2 slice is partitioned?

The other numbers are the same for V100: 512 bits per memory controller (8 memory controllers in the architecture image in the Volta architecture Nvidia white paper), 8 L2 slices per memory controller for HBM → 64 L2 slices.

No, half the L2 slices (and half the memory controllers) make up a partition, with a high-bandwidth interconnect between the two. Which is what enables Blackwell Bx00, for example, to be made of two separate dies in one package.

Then why does Volta have 32B/cycle/slice and later 100-class GPUs with partitioned L2-cache 64B/cycle/slice?

Either the change is independent of the partitioning after all,
or the number of L2 slices is per partition?

The newer GPUs also have significantly higher Tensor Core throughput (and higher per-SM, per-cycle float and int throughput) compared to V100. The increased L2 bandwidth from doubling slice width, plus massively increased capacity, as well as higher capacity L1/smem, is needed to keep those units fed.

Partitioning, as I understand it, is separate from that, but in aid of the same goal of providing more bandwidth at reasonable latency, as well as to lay the groundwork for multi-die GPUs (i.e. Blackwell and beyond).

There was a switch from 32B/L2 slice/cycle to 64B from V100 to A100 as can be seen from this Nvidia GTC presentation on page 24: https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21730-inside-the-nvidia-ampere-architecture.pdf

The later 100-class GPUs seem to have kept 64B/cycle/slice.

The consumer class GPUs seem to have 4 L2 slices per memory controller and 32B/cycle/slice even for later architectures.