Why does WMMA and MMA support different matrix tile size?

qingyao · October 28, 2023, 9:39pm

I want to do 8-bit matrix multiplications with tensor cores on A6000. According to the PTX guide, the WMMA instructions support m16n16k16, m8n32k16, and m32n8k16 for 8-bit matrix multiplication.

// Integer (.u8/.s8 multiplicands) wmma.mma
wmma.mma.sync.aligned.alayout.blayout.shape.s32.atype.btype.s32{.satfinite} d, a, b, c;

.alayout = {.row, .col};
.blayout = {.row, .col};
.shape  =  {.m16n16k16, .m8n32k16, .m32n8k16};
.dtype   = {.f16, .f32};
.atype   = {.s8, .u8};
.btype   = {.s8, .u8};
.ctype   = {.f16, .f32};

Meanwhile, the MMA instructions support m8n8k16, m16n8k16, m16n8k32 for 8-bit matrix multiplication.

mma.sync.aligned.shape.row.col{.satfinite}.s32.atype.btype.s32 d, a, b, c;

.shape   = {.m8n8k16, .m16n8k16, .m16n8k32}
.atype   = {.u8, .s8};
.btype   = {.u8, .s8};

Why do they support completely different matrix size configurations? What’s the difference between WMMA and MMA anyway? Which configuration should I use, if I have m = 32, n = 256, and k = 256?

Robert_Crovella · October 28, 2023, 9:54pm

I don’t have an exact answer for you. It is self-evident that the tensor core unit(s) have become more capable, and therefore more complicated, as architectural generations progressed (7.x → 8.x → 9.x etc.) If you go back to CUDA 9.0 PTX guide, I think you will not find mma and wgmma variants, just the wmma variant. So I suspect that as the TC units became more complicated, whatever formatting, syntax, and semantic meanings already “standardized” were insufficient, and so a new instruction type was exposed (mma, and then later wgmma).

Some differences between wmma and mma are described here.

None of the instruction variants directly support much larger sizes like m=32, n=256, k=256. So you will need to decompose that problem, if you intend to roll your own. You might also be interested in CUTLASS.

system · November 11, 2023, 9:54pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to choose from so many `mma.mnk*` instructions? CUDA Programming and Performance	3	333	June 19, 2024
Questions about mma instruction with Nvidia ptx CUDA Programming and Performance cuda	1	128	July 15, 2024
Inline PTX and WMMA instructions CUDA Programming and Performance	1	353	March 12, 2024
The HMMA.884 tensor core instruction seems not match with its cuda warp-level mma instruction CUDA Programming and Performance	5	280	August 22, 2024
WMMA vs. MMA CUDA Programming and Performance	2	1141	January 7, 2025
How to use WMMA efficiently CUDA Programming and Performance	4	8241	October 23, 2020
About tensor core's flops/clk and wmma shape? CUDA Programming and Performance	1	981	October 22, 2023
How does 4x4 mma at tensor core level translate to 16x16 mma at warp level? CUDA Programming and Performance cuda	2	1032	November 15, 2023
Using Tensor Cores in CUDA Fortran Technical Blog	1	445	March 7, 2025
Mma m8n8k4 on A100 CUDA Programming and Performance	10	192	November 14, 2024

Why does WMMA and MMA support different matrix tile size?

Related topics