Does SMEM swizzle mode affect tcgen05.mma throughput? (SM100, fp8 SS MMA)

aidando73 · February 18, 2026, 3:28am

I’m trying to understand the purpose of SMEM swizzle patterns (SWIZZLE_NONE vs SWIZZLE_32B vs SWIZZLE_64B vs SWIZZLE_128B) for tcgen05.mma on Blackwell (SM100/B200).

I wrote a minimal benchmark that runs an fp8 SS MMA (M=64, N=64, K=128, kind::f8f6f4, cta_group::1) with four different SMEM descriptor layout_type settings: SWIZZLE_NONE, SWIZZLE_32B, SWIZZLE_64B, and SWIZZLE_128B. Each variant stores data in the correct physical format for its layout type, constructs valid descriptors, and produces correct numerical results (verified against CPU reference).

However, when I measure MMA-only throughput using device-side clock64() (excluding SMEM fill and TMEM readback), all four variants show identical performance:

Swizzle Mode	MMA cycles (100 iters)	Cycles/iter
SWIZZLE_NONE	12,960	129
SWIZZLE_32B	12,923	129
SWIZZLE_64B	12,923	129
SWIZZLE_128B	12,923	129

NCU profiling also shows 0 in l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum for all four variants.

Questions:

Does swizzle mode actually affect tcgen05.mma SMEM read throughput? The documentation says swizzle avoids SMEM bank conflicts, but I can’t observe any performance difference. Is it possible the tensor core’s SMEM read path doesn’t experience bank conflicts (or handles them internally)?
Is the effect only visible at scale? My test uses a single CTA with one MMA stream. Would bank conflicts from swizzle mismatches only become visible in a production kernel where MMA competes with TMA/UTCCP for SMEM bandwidth?
Are there NCU metrics that capture the tensor core’s SMEM read path? l1tex__data_bank_conflicts showed 0 for me. Is there a different counter for descriptor-based MMA reads on SM100?

Repro:

gist.github.com

https://gist.github.com/aidando73/ab3f91f618b29ff7edf5c0a0d6d5aedd

swizzle_mma_repro.cu

// Minimal repro: fp8 SS MMA with NoSwizzle vs SW32 vs SW64 vs SW128.
// All four produce identical MMA throughput (129 cycles/iter).
// Only depends on CUTLASS (no FlashMLA, no mla_decode_v21).
//
// Build & run (requires sm_100a GPU + CUTLASS source):
//   nvcc -std=c++20 -O2 --generate-code=arch=compute_100a,code=[sm_100a] \
//     -I third-party/cutlass/include \
//     -o scripts/swizzle_mma_repro scripts/swizzle_mma_repro.cu \
//     && ./scripts/swizzle_mma_repro

This file has been truncated. show original

Minimal self-contained repro (~400 lines, depends only on CUTLASS headers): [attached]

nvcc -std=c++20 -O2 --generate-code=arch=compute_100a,code=[sm_100a] \
  -I cutlass/include -o swizzle_mma_repro swizzle_mma_repro.cu

Environment: CUDA 13.0, B200, CUTLASS 4.x

Repro file

scripts/swizzle_mma_repro.cu – self-contained, no FlashMLA or other external dependencies.

Curefab · February 18, 2026, 9:27am

Do you get full shared memory performance? 128 bytes/SM/cycle?

Is there any hint in the documentation that some swizzle patterns should create bank conflicts?

The different read patterns allow flexibility when writing. So here you can possibly avoid bank conflicts then. The documentation may have referred to that.

Nanodeoclus · February 18, 2026, 10:32am

As far as I can tell (and your observations confirm it) neither WGMMA nor tcgen05.mma reads from smem cause bank conflicts. They always read 128B at a time across all 32 banks. SWIZZLE_NONE is a tiled layout where each 8x16B tile occupies a consecutive 128 bytes. Reading 8x16B tiles at a time is thus bank conflict free. The swizzle patterns are there to avoid bank conflicts when writing to smem (which is why there are different ones).

Is the 37 cycle delay you’re observing for SWIZZLE_NONE consistent? Does it change or stay constant when changing iters?

aidando73 · February 18, 2026, 3:15pm

@Nanodeoclus - yes I believe that’s correct

I got a response from our Nvidia dev tech:

All supported MMA swizzle modes (including no-swizzle, i.e. 8x16B interleaved) are bank conflict free both from MMA side (read) and TMA side (write), but using no-swizzle or smaller swizzle modes can reduce TMA achievable throughput.

Might explain the swizzle none 37 delay as well

system · March 4, 2026, 3:15pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why does SW128 / Swizzle<3,4,3> produce identical bank patterns across all 8 rows? CUDA Programming and Performance	3	90	February 17, 2026
Bank Conflicts When Using wmma::load_matrix in CUDA without Swizzle? CUDA Programming and Performance	0	240	September 12, 2024
Using store_matrix_sync with SMEM: bank conflict? CUDA Programming and Performance	3	178	September 25, 2024
SM100 TMEM: rationale for per-warp access restriction (tcgen05.ld/st)? CUDA Programming and Performance	1	99	February 27, 2026
Fastest Tiled WMMA for Matrices of Any Size? CUDA Programming and Performance	3	472	October 26, 2024
How to understand the bank conflict of shared_mem CUDA Programming and Performance	16	15549	November 19, 2025
Evaluating SMem Access Pattern and SMem Efficiency in NVVP CUDA Programming and Performance	23	1549	November 29, 2018
Extremely slow smem reads Possible bank conflicts? CUDA Programming and Performance	11	3659	November 25, 2008
smem bank conflicts CUDA Programming and Performance	4	5109	September 30, 2008
To avoid bank conflict CUDA Programming and Performance	3	205	December 1, 2024

Does SMEM swizzle mode affect tcgen05.mma throughput? (SM100, fp8 SS MMA)

Repro file

Related topics