I’m trying to understand the purpose of SMEM swizzle patterns (SWIZZLE_NONE vs SWIZZLE_32B vs SWIZZLE_64B vs SWIZZLE_128B) for tcgen05.mma on Blackwell (SM100/B200).
I wrote a minimal benchmark that runs an fp8 SS MMA (M=64, N=64, K=128, kind::f8f6f4, cta_group::1) with four different SMEM descriptor layout_type settings: SWIZZLE_NONE, SWIZZLE_32B, SWIZZLE_64B, and SWIZZLE_128B. Each variant stores data in the correct physical format for its layout type, constructs valid descriptors, and produces correct numerical results (verified against CPU reference).
However, when I measure MMA-only throughput using device-side clock64() (excluding SMEM fill and TMEM readback), all four variants show identical performance:
| Swizzle Mode | MMA cycles (100 iters) | Cycles/iter |
|---|---|---|
| SWIZZLE_NONE | 12,960 | 129 |
| SWIZZLE_32B | 12,923 | 129 |
| SWIZZLE_64B | 12,923 | 129 |
| SWIZZLE_128B | 12,923 | 129 |
NCU profiling also shows 0 in l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum for all four variants.
Questions:
- Does swizzle mode actually affect
tcgen05.mmaSMEM read throughput? The documentation says swizzle avoids SMEM bank conflicts, but I can’t observe any performance difference. Is it possible the tensor core’s SMEM read path doesn’t experience bank conflicts (or handles them internally)? - Is the effect only visible at scale? My test uses a single CTA with one MMA stream. Would bank conflicts from swizzle mismatches only become visible in a production kernel where MMA competes with TMA/UTCCP for SMEM bandwidth?
- Are there NCU metrics that capture the tensor core’s SMEM read path?
l1tex__data_bank_conflictsshowed 0 for me. Is there a different counter for descriptor-based MMA reads on SM100?
Repro:
Minimal self-contained repro (~400 lines, depends only on CUTLASS headers): [attached]
nvcc -std=c++20 -O2 --generate-code=arch=compute_100a,code=[sm_100a] \
-I cutlass/include -o swizzle_mma_repro swizzle_mma_repro.cu
Environment: CUDA 13.0, B200, CUTLASS 4.x
Repro file
scripts/swizzle_mma_repro.cu – self-contained, no FlashMLA or other external dependencies.