How can we achieve maximum L2 cache throughput on an RTX 3080 GPU?

SparkHu · June 9, 2026, 4:09am

I am optimizing a program in which each SM runs one block. The pseudocode is as follows::

//Located in gmem. Each element is a 512-byte block. T can be tuned to an //appropriate value in order to achieve optimal L2 throughput performance.
B[8][T];
__shared__ A[64KB]
load A from gmem
//warp id in block
warp_id = threadIdx.x / 32
for i in [0..T]:
    data_b <- B[warp_id][i]
    data_a <- A[XX]
    calculate data op data_a

As shown in the pseudocode, each block reads the same B array data, while different warps within the same block read different portions of the B array. Each block reads the B array only once. Since the B array is relatively small, it can be fully cached in L2. The A data, on the other hand, is entirely cached in shared memory. Given this memory access pattern, how should the B array be arranged in memory in order to achieve maximum L2 throughput? I estimate that the actual bandwidth achievable from the L2 cache will determine the upper bound of the overall program performance.

The reason I believe that the data layout of B determines the achievable upper bound of L2 throughput is that I understand that on the RTX 3080 GPU I am currently using, there are a total of 10 L2 partitions, and each cache line is stored in one of these partitions. Each partition also contains internal sub-partitions. To achieve optimal performance, data accesses must be evenly distributed across the sub-partitions of all L2 partitions. However, I do not know how addresses are mapped to these sub-partitions.

All of the above understanding is based on responses generated by AI; please point out any inaccuracies if present.

GPU: RTX 3080

Additionally, I have another question: is there any difference in the achievable L2 bandwidth when using the cp.async.cg instruction to copy data from L2 to shared memory compared to directly using ld instructions to access L2(Assume no cache hit on L1)?

Curefab · June 9, 2026, 1:11pm

Robert and Greg know the details about the internal architecture in more detail than me.

You can use Nsight Compute to measure the current cache hit rates and achieved throughput compared to the maximum possible.

If you want to get maximum L2 throughput, you should read sectors of 32 bytes or probably slightly better (as they share tags) full cache lines of 128 bytes at the same time. The L1 cache will compensate to a certain degree for non-optimal accesses.
[corrected the sizes according to Nanodeoclus’ mesasge]

If each element is a 512 bytes block, then this line
data_b <- B[warp_id][i]
for a single i reads 512 bytes?

How is it distributed on the threads of the warp? Each threads reads 128 bits (e.g. uint4 or float4)?

I would guess that your program will exhibit nearly optimal performance from the start and that there will be little you can do to improve L2 performance more (talking about small single-digit percentages perhaps).

As long as your LD/ST (Load Store) units in the SM are not fully occupied.
cp.async.cg probably won’t give an improvement.

Different approaches to improve your program could be

to lower the number of or size of necessary L2 accesses, e.g. by compression, or a different algorithm like fusing several iteration steps of the algorithm
to use a Blackwell class device for distributed shared memory
to try to keep your data in the combined L1s or shared memory of the SMs - 68 * 128 KiB L1/SM = 8.5 MiB; however, L1 and shared memory gets erased between kernel calls

SparkHu · June 10, 2026, 3:22am

Thank you for your reply.

B[warp_id][i] actually corresponds to the A-matrix tile used by the mma.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32 instruction on the RTX 3080 Tensor Cores. At the moment, I store it in global memory using the layout required by the MMA instruction for the A matrix.

Each time, I use the following instruction to copy the data from global memory to shared memory:

cp.async.cg.shared.global [%0], [%1], 16

In order to measure the upper bound of the achievable L2 bandwidth, I asked an AI assistant (Claude Code + Opus 4.8) to write a microbenchmark. The measured result was only around 1.1 TB/s.

I am not sure whether this is because I am not using cp.async correctly, because the L2 partitions are not being utilized evenly, or because both factors are contributing to the result.

Could you please explain what you mean by the above statement?

My current understanding is that both ld instructions and cp.async instructions are dispatched to the MIO instruction queue and wait there to be processed. Based on what you said, is it correct to understand that ld instructions are handled by the conventional LD/ST units, while cp.async is processed by a different hardware unit?

I asked an AI assistant to benchmark the L2 bandwidth under two different memory access patterns. The pseudocode is as follows:

constants:

GROUP, NG                 # commit every GROUP copies; keep NG groups in flight

SLOTS = GROUP \* NG        # 16B copies issued per thread per iteration

REGION = 5 MB             # global region == L2 size, L2-resident after warmup

each thread:

smem_addr\[SLOTS\]          # per-warp CONTIGUOUS shared staging (same in both)

gmem_addr\[SLOTS\]          # <-- THE ONLY DIFFERENCE  (sections 1 vs 2)

repeat iters times:

    for grp in 0..NG-1:

        for c in 0..GROUP-1:

            cp.async.cg  smem_addr\[grp\*GROUP + c\] <- gmem_addr\[grp\*GROUP + c\]   # 16B, bypass L1 -> L2

        cp.async.commit_group

        cp.async.wait_group(NG-1)         # keep (NG-1)\*GROUP copies outstanding

    \# addresses are re-read every iteration (volatile asm => not optimized away)

cp.async.wait_group(0)

A warp’s 32 lanes always read `lane*16` => one **contiguous, 512B-aligned**

block per (warp, slot). Coalescing and alignment are therefore the same in both

kernels. Only the *spatial layout of the SLOTS blocks* differs.
bandwidth = blocks * threads * iters * SLOTS * 16 / elapsed

1. Pattern (A) — scattered → ~1.1 TB/s

gmem_addr(warp w, lane L, slot s) =

        g + ( w*512 + L*16 + s*(total*16) )  mod  REGION

                              ~~~~~~~~~~~~

                              huge per-slot stride:  total*16 = gridDim*blockDim*16 ~= 557 KB


per warp : SLOTS blocks, each ~557 KB apart  -> fanned out across the whole 5 MB

Global 5 MB, drawn as 512B blocks:

 0        ~0.5M     ~1.1M     ~1.7M             ~4.5M

 |---------|---------|---------|---- ... -------|

 warp 0 : [s0]      [s1]      [s2]    ...       [s7]

 warp 1 :  [s0]      [s1]      [s2]    ...        [s7]      (+512B)

 warp 2 :   [s0]      [s1] ...

           \____________ at slot 0 every warp is clustered here ____________/

A warp’s 8 accesses fan out over ~4.5 MB; at a given slot index all warps sit

in one narrow window that marches across the region slot by slot.

## 2. Pattern (B) — contiguous (GEMM A-column / B-row style) → ~1.62 TB/s

bpw = floor(REGION / num_warps / 512)          # contiguous 512B blocks per warp (e.g. 9)

gmem_addr(global warp W, lane L, slot s) =

        g + W*(bpw*512) + L*16 + (s mod bpw)*512

            ~~~~~~~~~~~

            each warp owns a PRIVATE, CONTIGUOUS, NON-OVERLAPPING region

per warp : bpw contiguous 512B blocks (~4.6 KB); slots cycle inside that region

Global 5 MB, partitioned into per-warp contiguous regions (~4.6 KB each):

 0       4.6K     9.2K    13.8K                      5M

 |--------|--------|--------|----- ... ------|--------|

 warp 0 : [b0..b8]

 warp 1 :          [b0..b8]

 warp 2 :                   [b0..b8]

   ...                                                [b0..b8]  warp (N-1)

Each warp stays inside its own tight region; the warps tile the full 5 MB.

At any instant the warps are spread across all of L2 => every slice stays busy.

The complete benchmarking code generated by the AI:

l2_read_pattern_compare.txt (11.8 KB)

One more question:

How can the theoretical peak bandwidth of the L2 cache be estimated? For example, can we derive it by knowing the interface width and operating frequency of each L2 partition on the SM-facing side and then calculating the aggregate bandwidth ourselves?

SparkHu · June 10, 2026, 3:30am

For the optimization approaches you mentioned:

The data I load is used for matrix multiplication, so I am not aware of any compression algorithms that would be applicable in this case.
My goal is specifically to develop the highest-performance implementation for architectures starting from Ampere onward, so I cannot rely solely on moving to newer hardware to improve performance. At this point, I have already used AI to help develop an implementation that achieves approximately 95% Tensor Core utilization. However, I would like to rethink the algorithm and its implementation in order to push Tensor Core utilization to 99%, thereby further improving performance. The algorithm consists entirely of matrix operations, and the overall performance of the program is directly determined by Tensor Core utilization.
I have already built a three-level data caching hierarchy using registers, shared memory, and L2 cache to store data. Based on my analysis, the L2 cache is under the greatest bandwidth pressure. Therefore, what I need to understand now is: what is the maximum achievable L2 bandwidth, and what access pattern is required to reach that maximum bandwidth?

Curefab · June 10, 2026, 3:53am

So one important question is, why is pattern B faster with 1.62 TB/s?

Is some resource occupied too much in A?
I would guess that the stride in A is specifically chosen to use the same memory partition repeatedly.
Also for A all SMs access the same memory addresses, in B the address is dependent on the blockIdx.

So probably it is just an ‘unrealistic bad benchmark’ and in practice you get the B numbers.

Nsight Compute (please use it) may have the available bandwidth stored. There also may be benchmarks out there.

There is no public fixed speed rule to compute L2 max bandwidth, but for GPUs of the same architecture it should scale like global memory speed.

Typically it is 2x to 4x global memory speed, depending on the GPU.

The number of bits of the memory interface (every 32 for GDDR, every 512 bits for HBM) determine the number of memory partitions, each responsible for part of the L2 cache.

SparkHu · June 10, 2026, 7:25am

Sorry, I discovered that the A pattern implemented by the AI contained a serious flaw. It caused threads with the same thread index in different blocks to access exactly the same addresses in each iteration. After correcting this issue, the benchmark was also able to achieve approximately 1.6 TB/s.

SparkHu · June 10, 2026, 7:33am

However, I still have one question: how are global memory addresses mapped to different L2 partitions?

In the future, when I need to cache a relatively small amount of data in L2, I might consider creating multiple copies of that data so that they are distributed more evenly across different L2 partitions, thereby maximizing the achievable L2 bandwidth.

Nanodeoclus · June 10, 2026, 10:28am

I believe sectors are 32B and cache lines (4 sectors with shared tags) are 128B.

Take the number of L2 cache slices (reported as device__attribute_l2s_count in Nsight Compute; Should be 40 for RTX 3080), multiply it by the achieved core frequency (boost clock for RTX 3080 seems to be 1710MHz) times 32B. Max L2 bandwidth is thus 1.99TiB/s.

This is undocumented and may differ between the various memory/cache configurations even for the same chip. However, it is clear that the scheme generally aims to achieve even distribution of any larger contiguous block of memory across all L2 slices (and, implicitly tied to that, all DRAM modules). 40*32B is only 1.25KiB. If the data you want to cache is much larger than that it will very likely be efficiently distributed. If it’s smaller than that, you might want to keep local (L1/smem/registers) copies of it rather than rely on L2.

Curefab · June 10, 2026, 10:30am

Nvidia tries to optimize it in a way that for all practical workloads, the accesses are balanced across the memory partitions.

An additional question is, whether you really have the physical memory address available or a virtual one (with perhaps the lower bits being reused for both).

I will give you some pointers into that topic:

SGDRC: Software-Defined Dynamic Resource Control for Concurrent DNN Inference on NVIDIA GPUs
(the L2 cache mapping is one topic in their paper)

https://arxiv.org/pdf/2407.13996

FGPU [their project Fractional GPU] assumes that the GPU L2
cacheline and DRAM bank hash mapping functions are pure
XOR functions. We attempted to reverse engineer other
GPUs using FGPU’s approach, but all failed because this
assumption does not hold for many NVIDIA GPUs. The key
issue lies in the nature of the XOR function: it is linear, map-
ping a VRAM space of size 2𝑁 bytes to 2𝑀 VRAM channels.
However, many GPUs use non-linear VRAM channel hash
mappings, which map a VRAM space of arbitrary size to
an arbitrary number of VRAM channels. This non-linearity
arises because the VRAM size and the number of VRAM
channels are often not powers of 2 (Tab. 1). The number of
VRAM channels can be cross-validated by the number of
GDDR chips on the GPU (Fig.18 in §A.2) and the theoretical
calculation (i.e., VRAM bus width divided by the bus width
per memory unit). Furthermore, FGPU only supports page
coloring based on 4 KiB granularity, the minimum page size
supported by NVIDIA GPU’s Memory Management Unit
(MMU) [ 37]. However, this is inapplicable to newer GPU
architectures (§5.2).

Reverse Engineering GPU L2 cache/DRAM structure
Here you see the mapping for the GTX 1070, GTX 1080 (both Pascal) and V100 (Volta)
And they provide the source code to find out the mapping for other GPUs.

github.com/sakjain92/Fractional-GPUs

doc/REVERSE.md

master

# Reverse Engineering GPU L2 cache/DRAM structure

## Running reverse engineering code

We suspect that the memory hierarchy is same for all the GPUs of the same
architecture family. We base this suspicion on the fact that both GTX 1070 and GTX 1080
have the exact same memory hierarchy (based on our reverse engineering) and both belong to
Pascal architecture. We have already reversed engineering Pascal and Volta architectures,
details of which are provided below.

To reverse engineer details about other architectures, it is required to run reverse engineering code.
For this, FGPU code needs to be compiled in reverse engineering mode (See *[doc/BUILD.md](../doc/BUILD.md)*).

When reverse engineering mode is enable and FGPU is compiled, reverse engineering 
code is builded in form of a binary *gpu_reverse_engineering*, present in the build
directory. Running it provides with the details of L2 cache and DRAM structure of the current GPU. 
Refer to *[doc/PORT.md](../doc/PORT.md)* on how to run FGPU applications.

Also, some support needs to be added for that specific architecture in the device driver:
*$PROJ_DIR/driver/NVIDIA-Linux-x86_64-390.48/kernel/nvidia-uvm/uvm8_\<arch\>.c*

This file has been truncated. show original

Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling

That is a GPU simulator discussing that topic. Their solution could be identical in typical performance, but could use a different mapping.

Our GPU cache model supports a throughput-oriented,
banked, and sectored cache design [32], [62]. The cache is
flexible enough to model GPUs from Kepler through Turing.
We also model the CPU-GPU memory copy engine, since
all DRAM accesses go through the L2, including CPU-GPU
memory copies [53]. To reduce unbalanced memory accesses
across L2 memory partitions, which we refer to as partition
camping [3], [33], we add advanced partition indexing that
xors the L2 bank bits with randomly selected bits from the
page row bits using a Galois-based irreducible polynomial
(IPOLY) interleaving mechanism [56]. Partition camping is
a major problem in contemporary GPUs that have 2n memory
partitions, like High Bandwidth Memory (HBM) [24] which
has 8 channels per stack [33], [51]. IPOLY hashing is guaran-
teed to be conflict-free for all 2n strides, which are common in
GPGPU applications, and also shows reasonable, deterministic
performance for other strides [56].

Curefab · June 10, 2026, 10:38am

Sure. Probably was too tired. Sorry and thank you.

That is interesting - the size of L2 is proportional to the memory interface width.
E.g. for Ampere it is 512 KiB L2 size per 32 bits memory interface (for GDDR).
The RTX 3080 with its 320 bits has 10 x 512 KiB = 5 MiB of L2 cache.
Then probably device__attribute_l2s_count is always 4x as high?

You say it can do 32B per cycle, that is 4x32B = 4 sectors per memory interface. That makes sense.

With HBM memory interfaces it seems to be 4x as much:

An A100 has 5120 bits memory interface, which is 10x512 bits (512 bits per memory interface with HBM).
Officially the A100 L2 has a max. bandwidth of 5120 bytes per clock, so 512 bytes per clock per memory interface (16 sectors) instead of 128 bytes per clock (4 sectors).

SparkHu · June 10, 2026, 1:09pm

Given that the theoretical peak bandwidth is 2 TB/s, I’m now even more confused about why I’m only able to achieve 1.6 TB/s of bandwidth.

Nanodeoclus · June 10, 2026, 1:24pm

I find your pseudocode a bit hard to follow, but since the GPU has a total of exactly 5 MiB of L2 you probably want to keep your active set a bit smaller than that, since there is bound to be at least some extra memory traffic (kernel parameters, instructions, etc.) over which you have no control. I’d suggest running it again with 2 or 4 MiB.

Nanodeoclus · June 10, 2026, 3:17pm

Ah yes, what I wrote above may be incorrect for 100-class GPUs starting from A100. I.e. A100 reports 80 L2 slices, but indeed the A100 whitepaper says it has an overall L2 bandwidth of 5120B/cycle (compared to V100’s 2048B/cycle), which would be 64B per slice.

This gets complicated by the way L2 is partitioned, though, such that a first read from a cache line belonging to the far partition gets cached in the near partition, requiring space and bandwidth from both partitions.

Curefab · June 10, 2026, 5:52pm

Trying to get the terms (more) correctly now.

The actual DDR or GDDR memory is connected via memory controllers. They are shown in the Nvidia architeccture diagrams surrounding the GPCs.

Each GDDR memory controller handles 32 bits, each HBM memory controller handles 512 bits. So from the bitwidth of the memory interface and the memory type, one can calculate the number of memory controllers.

It seems (empirically) that the memory controllers also handle the L2 caches.
Each memory controller for any GPU of the same compute architecture has a fixed size of L2 cache.

And each GDDR memory controller has 4 L2 slices with 32 bytes per cycle each.

Each HBM memory controller has 8 L2 slices with 64 bytes per cycle bandwidth each.

Nanodeoclus · June 10, 2026, 7:46pm

I wouldn’t say that “the memory controllers also handle the L2 caches”, as load/store ops interact with L2 first. But there’s a fixed association between memory controllers and L2 cache slices.

Not V100, though, which appears to have a 4096bit HBM2 memory interface and a unified L2 cache with 64 slices and a bandwidth of 32B/cycle per slice. It’s the later GPUs that have a partitioned L2 cache which increase bandwidth to 64B/cycle per slice.

Curefab · June 10, 2026, 8:38pm

So the partitioning is per memory controller? Meaning each memory controller is directly connected to both L2 partitions (and not half of memory controllers to each partition).

Or, actually, as we are talking about 32 vs. 64B/cycle/slice, each L2 slice is partitioned?

The other numbers are the same for V100: 512 bits per memory controller (8 memory controllers in the architecture image in the Volta architecture Nvidia white paper), 8 L2 slices per memory controller for HBM → 64 L2 slices.

Nanodeoclus · June 11, 2026, 6:03pm

No, half the L2 slices (and half the memory controllers) make up a partition, with a high-bandwidth interconnect between the two. Which is what enables Blackwell Bx00, for example, to be made of two separate dies in one package.

Curefab · June 11, 2026, 6:27pm

Then why does Volta have 32B/cycle/slice and later 100-class GPUs with partitioned L2-cache 64B/cycle/slice?

Either the change is independent of the partitioning after all,
or the number of L2 slices is per partition?

Nanodeoclus · June 13, 2026, 1:00pm

The newer GPUs also have significantly higher Tensor Core throughput (and higher per-SM, per-cycle float and int throughput) compared to V100. The increased L2 bandwidth from doubling slice width, plus massively increased capacity, as well as higher capacity L1/smem, is needed to keep those units fed.

Partitioning, as I understand it, is separate from that, but in aid of the same goal of providing more bandwidth at reasonable latency, as well as to lay the groundwork for multi-die GPUs (i.e. Blackwell and beyond).

Curefab · June 13, 2026, 1:43pm

There was a switch from 32B/L2 slice/cycle to 64B from V100 to A100 as can be seen from this Nvidia GTC presentation on page 24: https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21730-inside-the-nvidia-ampere-architecture.pdf

The later 100-class GPUs seem to have kept 64B/cycle/slice.

The consumer class GPUs seem to have 4 L2 slices per memory controller and 32B/cycle/slice even for later architectures.

Topic		Replies	Views
Tesla K40 L2 bandwidth CUDA Programming and Performance	12	4246	December 23, 2015
L2 Bandwidth Value for A100 Calculation CUDA Programming and Performance	5	289	January 28, 2025
How to correctly write code to test A100 L2 bandwidth？ CUDA Programming and Performance	6	2653	October 17, 2023
Cuda Memory Bank layout Interleaving, Addressing, Conflicts CUDA Programming and Performance	25	61811	September 4, 2008
Squeasing max d2d memory bandwidth (GTX 480) CUDA Programming and Performance	15	7193	November 2, 2010
Technical questions on GTX1080ti multiplication CUDA Programming and Performance	14	2204	November 11, 2017
Even more Fermi Fun: Uncoalesced writes CUDA Programming and Performance	8	9006	June 5, 2010
Why the performance of tf32 tensor_core is poor? CUDA Programming and Performance	20	2203	August 8, 2023
Global memory access bottleneck CUDA Programming and Performance	8	3661	September 4, 2015
How to get the most dot products of batched vectors out of L4 GPU CUDA Programming and Performance	13	468	May 28, 2025

How can we achieve maximum L2 cache throughput on an RTX 3080 GPU?

Related topics