SM100 TMEM: rationale for per-warp access restriction (tcgen05.ld/st)?

aidando73 · February 26, 2026, 3:46pm

Hi,

I’m writing a custom attention decode kernel for Blackwell (SM100a) and have run into a limitation with Tensor Memory access restrictions that I’d like to understand better.

Background

PTX ISA 9.7.16.8.1 states:

“The Tensor Memory of a CTA is divided into 4 equal chunks such that each warp of a warpgroup in the CTA can access a chunk of the Tensor Memory. A lane of the Tensor Memory can be accessed by a single warp in the warpgroup.”

Warp	Accessible dp lanes
0	0-31
1	32-63
2	64-95
3	96-127

The problem

The shapes are:

QK GEMM:  P[64, 128] = Q[64, 576] × K^T[576, 128]     (fp8 × fp8 → fp32)
Softmax:  S[64, 128] = softmax(P)                       (fp32 → fp8)
SV GEMM:  O[64, 512] = S[64, 128] × V[128, 512]        (fp8 × fp8 → fp32)

Using cta_group::1 (single SM), M=64 throughout. After the QK GEMM, a softmax warpgroup reads P from TMEM, computes S = softmax(P) in registers, and then S needs to feed into the SV GEMM as the A operand.

Current approach (SS MMA for SV): softmax writes S from registers → SMEM (via STS), then the SV GEMM reads S from SMEM. This works, but the STS path has ~3.5M shared memory bank conflicts per kernel launch. The SW128 swizzled layout required by the MMA descriptor and the softmax thread-to-column mapping create unavoidable multi-way conflicts.

What I want (TS MMA for SV): bypass SMEM entirely for S — softmax writes S from registers → TMEM (via tcgen05.st), then the SV GEMM reads S directly from TMEM as the A operand (tcgen05.mma.ws.cta_group::1.kind::f8f6f4 [C], [A], B_desc, ... — note [A] = TMEM address). Zero bank conflicts since SMEM is never touched.

The blocker: the .ws SV GEMM with M=64 uses Layout E (2x3 datapath, tmem_frg_ws_1sm). All 128 dp are used, with each output row R split across two dp groups:

dp R (warp 0 or 1, dp 0-63) holds one half of O[row R]'s N columns
dp R+64 (warp 2 or 3, dp 64-127) holds the other half of O[row R]'s N columns

The MMA operates dp-by-dp (O[dp] += S[dp] × V), so S must contain the same row at both dp R and dp R+64 for both halves to be correct.

Using tcgen05.ld.sync.aligned.16x32bx2, I can read the non-ws P accumulator (Layout F) efficiently — lanes 0-15 read the first half of each row and lanes 16-31 read the second half from the same dp, so all 32 lanes per warp are productive and each thread only needs 64 registers. After softmax, I can write S back via tcgen05.st.sync.aligned.16x32bx2 in the same pattern. But that means S only lands in the warp’s own dp range (e.g., warp 0 writes to dp 0-15, warp 2 writes to dp 64-79), and each dp holds a different row. The SV GEMM needs dp R and dp R+64 to hold the same row, but the per-warp access restriction means warp 0 cannot duplicate its S into warp 2’s dp range.

The workarounds I’ve found are:

SMEM exchange: STTM → SMEM → sync → SMEM read → STTM at the other warp’s dp. This adds overhead (two syncs + SMEM round-trip) that negates the benefit of bypassing SMEM in the first place.
Non-ws TS MMA (tcgen05.mma without .ws, Layout F, 4x1 datapath): this keeps the O accumulator single-dp-group (tmem_frg_1sm) — each active dp holds the full N columns, so no duplication of S is needed. But with D_V=512, a single-dp-group O needs 512 TMEM columns, which exceeds what’s available alongside S and P. So I’d have to N-tile the SV GEMM (e.g., 4 tiles of N=128), adding complexity around where to accumulate partial O tiles across KV blocks.

Questions

What is the architectural rationale for the per-warp TMEM access restriction? Is it a physical wiring constraint (each warp’s register file connects to a specific TMEM bank), or a design choice for simplicity/correctness?
Is there any mechanism I’m missing that would allow one warp to write data that another warp can read from its own dp range — without going through SMEM? For example:
- A broadcast/multicast TMEM store?
- A TMEM-to-TMEM copy between dp ranges?
- A different tcgen05.st shape/qualifier that relaxes the restriction?
For the .ws Layout E use case specifically (where both dp groups 0-63 and 64-127 need the same data): is the intended pattern to always go through SMEM to populate both dp groups, or is there a recommended way to duplicate data across warp-owned dp ranges?

Any insight into the design intent would help me choose the right approach. Thanks!

Environment: B200 (SM100a), CUDA 13.0, PTX ISA 9.1

striker159 · February 27, 2026, 6:50am

Each warp within a warp group is assigned to a different SM sub partition. Each sub partition has its own tensor memory of size 64 kb, i.e. 32 lanes of 2 kb. See for example Figure 2 in this nvidia blog post Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Era | NVIDIA Technical Blog

That is the reason behind access restrictions.

Topic		Replies	Views
about the __syncwarp() in P100 CUDA Programming and Performance	11	4219	June 6, 2018
Requesting clarification for Shared Memory Bank Conflicts and Shared memory access? CUDA Programming and Performance hw , cuda	11	5172	January 23, 2024
How to understand the bank conflict of shared_mem CUDA Programming and Performance	16	15548	November 19, 2025
Can threads from different warps access shared memory at the same time? CUDA Programming and Performance	4	868	April 22, 2024
Does SMEM swizzle mode affect tcgen05.mma throughput? (SM100, fp8 SS MMA) CUDA Programming and Performance	4	83	March 4, 2026
Why the performance of tf32 tensor_core is poor? CUDA Programming and Performance	20	2140	August 8, 2023
Warp-shuffle - Shared Memory Performance Comparison For Reduction, Between RTX4070 and RTX5070 CUDA Programming and Performance	7	169	November 2, 2025
are the threads of a warp run serially？ CUDA Programming and Performance	9	1216	February 27, 2020
Attempt to access memory once (by the 0th thread out of 32) per warp causes memory access by all the threads CUDA Programming and Performance cuda	4	139	November 7, 2024
shared memory accesses for different compute capabilities CUDA Programming and Performance	2	2887	July 29, 2011

SM100 TMEM: rationale for per-warp access restriction (tcgen05.ld/st)?

Background

The problem

Questions

Related topics