SM100 TMEM: rationale for per-warp access restriction (tcgen05.ld/st)?

Hi,

I’m writing a custom attention decode kernel for Blackwell (SM100a) and have run into a limitation with Tensor Memory access restrictions that I’d like to understand better.

Background

PTX ISA 9.7.16.8.1 states:

“The Tensor Memory of a CTA is divided into 4 equal chunks such that each warp of a warpgroup in the CTA can access a chunk of the Tensor Memory. A lane of the Tensor Memory can be accessed by a single warp in the warpgroup.”

Warp Accessible dp lanes
0 0-31
1 32-63
2 64-95
3 96-127

The problem

The shapes are:

QK GEMM:  P[64, 128] = Q[64, 576] × K^T[576, 128]     (fp8 × fp8 → fp32)
Softmax:  S[64, 128] = softmax(P)                       (fp32 → fp8)
SV GEMM:  O[64, 512] = S[64, 128] × V[128, 512]        (fp8 × fp8 → fp32)

Using cta_group::1 (single SM), M=64 throughout. After the QK GEMM, a softmax warpgroup reads P from TMEM, computes S = softmax(P) in registers, and then S needs to feed into the SV GEMM as the A operand.

Current approach (SS MMA for SV): softmax writes S from registers → SMEM (via STS), then the SV GEMM reads S from SMEM. This works, but the STS path has ~3.5M shared memory bank conflicts per kernel launch. The SW128 swizzled layout required by the MMA descriptor and the softmax thread-to-column mapping create unavoidable multi-way conflicts.

What I want (TS MMA for SV): bypass SMEM entirely for S — softmax writes S from registers → TMEM (via tcgen05.st), then the SV GEMM reads S directly from TMEM as the A operand (tcgen05.mma.ws.cta_group::1.kind::f8f6f4 [C], [A], B_desc, ... — note [A] = TMEM address). Zero bank conflicts since SMEM is never touched.

The blocker: the .ws SV GEMM with M=64 uses Layout E (2x3 datapath, tmem_frg_ws_1sm). All 128 dp are used, with each output row R split across two dp groups:

  • dp R (warp 0 or 1, dp 0-63) holds one half of O[row R]'s N columns
  • dp R+64 (warp 2 or 3, dp 64-127) holds the other half of O[row R]'s N columns

The MMA operates dp-by-dp (O[dp] += S[dp] × V), so S must contain the same row at both dp R and dp R+64 for both halves to be correct.

Using tcgen05.ld.sync.aligned.16x32bx2, I can read the non-ws P accumulator (Layout F) efficiently — lanes 0-15 read the first half of each row and lanes 16-31 read the second half from the same dp, so all 32 lanes per warp are productive and each thread only needs 64 registers. After softmax, I can write S back via tcgen05.st.sync.aligned.16x32bx2 in the same pattern. But that means S only lands in the warp’s own dp range (e.g., warp 0 writes to dp 0-15, warp 2 writes to dp 64-79), and each dp holds a different row. The SV GEMM needs dp R and dp R+64 to hold the same row, but the per-warp access restriction means warp 0 cannot duplicate its S into warp 2’s dp range.

The workarounds I’ve found are:

  1. SMEM exchange: STTM → SMEM → sync → SMEM read → STTM at the other warp’s dp. This adds overhead (two syncs + SMEM round-trip) that negates the benefit of bypassing SMEM in the first place.

  2. Non-ws TS MMA (tcgen05.mma without .ws, Layout F, 4x1 datapath): this keeps the O accumulator single-dp-group (tmem_frg_1sm) — each active dp holds the full N columns, so no duplication of S is needed. But with D_V=512, a single-dp-group O needs 512 TMEM columns, which exceeds what’s available alongside S and P. So I’d have to N-tile the SV GEMM (e.g., 4 tiles of N=128), adding complexity around where to accumulate partial O tiles across KV blocks.

Questions

  1. What is the architectural rationale for the per-warp TMEM access restriction? Is it a physical wiring constraint (each warp’s register file connects to a specific TMEM bank), or a design choice for simplicity/correctness?

  2. Is there any mechanism I’m missing that would allow one warp to write data that another warp can read from its own dp range — without going through SMEM? For example:

    • A broadcast/multicast TMEM store?
    • A TMEM-to-TMEM copy between dp ranges?
    • A different tcgen05.st shape/qualifier that relaxes the restriction?
  3. For the .ws Layout E use case specifically (where both dp groups 0-63 and 64-127 need the same data): is the intended pattern to always go through SMEM to populate both dp groups, or is there a recommended way to duplicate data across warp-owned dp ranges?

Any insight into the design intent would help me choose the right approach. Thanks!


Environment: B200 (SM100a), CUDA 13.0, PTX ISA 9.1

Each warp within a warp group is assigned to a different SM sub partition. Each sub partition has its own tensor memory of size 64 kb, i.e. 32 lanes of 2 kb. See for example Figure 2 in this nvidia blog post Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Era | NVIDIA Technical Blog

That is the reason behind access restrictions.