Thank you for your reply.
B[warp_id][i] actually corresponds to the A-matrix tile used by the mma.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32 instruction on the RTX 3080 Tensor Cores. At the moment, I store it in global memory using the layout required by the MMA instruction for the A matrix.
Each time, I use the following instruction to copy the data from global memory to shared memory:
cp.async.cg.shared.global [%0], [%1], 16
In order to measure the upper bound of the achievable L2 bandwidth, I asked an AI assistant (Claude Code + Opus 4.8) to write a microbenchmark. The measured result was only around 1.1 TB/s.
I am not sure whether this is because I am not using cp.async correctly, because the L2 partitions are not being utilized evenly, or because both factors are contributing to the result.
Could you please explain what you mean by the above statement?
My current understanding is that both ld instructions and cp.async instructions are dispatched to the MIO instruction queue and wait there to be processed. Based on what you said, is it correct to understand that ld instructions are handled by the conventional LD/ST units, while cp.async is processed by a different hardware unit?
I asked an AI assistant to benchmark the L2 bandwidth under two different memory access patterns. The pseudocode is as follows:
constants:
GROUP, NG # commit every GROUP copies; keep NG groups in flight
SLOTS = GROUP \* NG # 16B copies issued per thread per iteration
REGION = 5 MB # global region == L2 size, L2-resident after warmup
each thread:
smem_addr\[SLOTS\] # per-warp CONTIGUOUS shared staging (same in both)
gmem_addr\[SLOTS\] # <-- THE ONLY DIFFERENCE (sections 1 vs 2)
repeat iters times:
for grp in 0..NG-1:
for c in 0..GROUP-1:
cp.async.cg smem_addr\[grp\*GROUP + c\] <- gmem_addr\[grp\*GROUP + c\] # 16B, bypass L1 -> L2
cp.async.commit_group
cp.async.wait_group(NG-1) # keep (NG-1)\*GROUP copies outstanding
\# addresses are re-read every iteration (volatile asm => not optimized away)
cp.async.wait_group(0)
A warpās 32 lanes always read `lane*16` => one **contiguous, 512B-aligned**
block per (warp, slot). Coalescing and alignment are therefore the same in both
kernels. Only the *spatial layout of the SLOTS blocks* differs.
bandwidth = blocks * threads * iters * SLOTS * 16 / elapsed
1. Pattern (A) ā scattered ā ~1.1 TB/s
gmem_addr(warp w, lane L, slot s) =
g + ( w*512 + L*16 + s*(total*16) ) mod REGION
~~~~~~~~~~~~
huge per-slot stride: total*16 = gridDim*blockDim*16 ~= 557 KB
per warp : SLOTS blocks, each ~557 KB apart -> fanned out across the whole 5 MB
Global 5 MB, drawn as 512B blocks:
0 ~0.5M ~1.1M ~1.7M ~4.5M
|---------|---------|---------|---- ... -------|
warp 0 : [s0] [s1] [s2] ... [s7]
warp 1 : [s0] [s1] [s2] ... [s7] (+512B)
warp 2 : [s0] [s1] ...
\____________ at slot 0 every warp is clustered here ____________/
A warpās 8 accesses fan out over ~4.5 MB; at a given slot index all warps sit
in one narrow window that marches across the region slot by slot.
## 2. Pattern (B) ā contiguous (GEMM A-column / B-row style) ā ~1.62 TB/s
bpw = floor(REGION / num_warps / 512) # contiguous 512B blocks per warp (e.g. 9)
gmem_addr(global warp W, lane L, slot s) =
g + W*(bpw*512) + L*16 + (s mod bpw)*512
~~~~~~~~~~~
each warp owns a PRIVATE, CONTIGUOUS, NON-OVERLAPPING region
per warp : bpw contiguous 512B blocks (~4.6 KB); slots cycle inside that region
Global 5 MB, partitioned into per-warp contiguous regions (~4.6 KB each):
0 4.6K 9.2K 13.8K 5M
|--------|--------|--------|----- ... ------|--------|
warp 0 : [b0..b8]
warp 1 : [b0..b8]
warp 2 : [b0..b8]
... [b0..b8] warp (N-1)
Each warp stays inside its own tight region; the warps tile the full 5 MB.
At any instant the warps are spread across all of L2 => every slice stays busy.
The complete benchmarking code generated by the AI:
l2_read_pattern_compare.txt (11.8 KB)
One more question:
How can the theoretical peak bandwidth of the L2 cache be estimated? For example, can we derive it by knowing the interface width and operating frequency of each L2 partition on the SM-facing side and then calculating the aggregate bandwidth ourselves?