Does SMEM swizzle mode affect tcgen05.mma throughput? (SM100, fp8 SS MMA)

I’m trying to understand the purpose of SMEM swizzle patterns (SWIZZLE_NONE vs SWIZZLE_32B vs SWIZZLE_64B vs SWIZZLE_128B) for tcgen05.mma on Blackwell (SM100/B200).

I wrote a minimal benchmark that runs an fp8 SS MMA (M=64, N=64, K=128, kind::f8f6f4, cta_group::1) with four different SMEM descriptor layout_type settings: SWIZZLE_NONE, SWIZZLE_32B, SWIZZLE_64B, and SWIZZLE_128B. Each variant stores data in the correct physical format for its layout type, constructs valid descriptors, and produces correct numerical results (verified against CPU reference).

However, when I measure MMA-only throughput using device-side clock64() (excluding SMEM fill and TMEM readback), all four variants show identical performance:

Swizzle Mode MMA cycles (100 iters) Cycles/iter
SWIZZLE_NONE 12,960 129
SWIZZLE_32B 12,923 129
SWIZZLE_64B 12,923 129
SWIZZLE_128B 12,923 129

NCU profiling also shows 0 in l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum for all four variants.

Questions:

  1. Does swizzle mode actually affect tcgen05.mma SMEM read throughput? The documentation says swizzle avoids SMEM bank conflicts, but I can’t observe any performance difference. Is it possible the tensor core’s SMEM read path doesn’t experience bank conflicts (or handles them internally)?
  2. Is the effect only visible at scale? My test uses a single CTA with one MMA stream. Would bank conflicts from swizzle mismatches only become visible in a production kernel where MMA competes with TMA/UTCCP for SMEM bandwidth?
  3. Are there NCU metrics that capture the tensor core’s SMEM read path? l1tex__data_bank_conflicts showed 0 for me. Is there a different counter for descriptor-based MMA reads on SM100?

Repro:

Minimal self-contained repro (~400 lines, depends only on CUTLASS headers): [attached]

nvcc -std=c++20 -O2 --generate-code=arch=compute_100a,code=[sm_100a] \
  -I cutlass/include -o swizzle_mma_repro swizzle_mma_repro.cu

Environment: CUDA 13.0, B200, CUTLASS 4.x

Repro file

scripts/swizzle_mma_repro.cu – self-contained, no FlashMLA or other external dependencies.

Do you get full shared memory performance? 128 bytes/SM/cycle?

Is there any hint in the documentation that some swizzle patterns should create bank conflicts?

The different read patterns allow flexibility when writing. So here you can possibly avoid bank conflicts then. The documentation may have referred to that.

As far as I can tell (and your observations confirm it) neither WGMMA nor tcgen05.mma reads from smem cause bank conflicts. They always read 128B at a time across all 32 banks. SWIZZLE_NONE is a tiled layout where each 8x16B tile occupies a consecutive 128 bytes. Reading 8x16B tiles at a time is thus bank conflict free. The swizzle patterns are there to avoid bank conflicts when writing to smem (which is why there are different ones).

Is the 37 cycle delay you’re observing for SWIZZLE_NONE consistent? Does it change or stay constant when changing iters?

1 Like

@Nanodeoclus - yes I believe that’s correct

I got a response from our Nvidia dev tech:

All supported MMA swizzle modes (including no-swizzle, i.e. 8x16B interleaved) are bank conflict free both from MMA side (read) and TMA side (write), but using no-swizzle or smaller swizzle modes can reduce TMA achievable throughput.

Might explain the swizzle none 37 delay as well

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.