Why does WMMA and MMA support different matrix tile size?

I want to do 8-bit matrix multiplications with tensor cores on A6000. According to the PTX guide, the WMMA instructions support m16n16k16, m8n32k16, and m32n8k16 for 8-bit matrix multiplication.

// Integer (.u8/.s8 multiplicands) wmma.mma
wmma.mma.sync.aligned.alayout.blayout.shape.s32.atype.btype.s32{.satfinite} d, a, b, c;

.alayout = {.row, .col};
.blayout = {.row, .col};
.shape  =  {.m16n16k16, .m8n32k16, .m32n8k16};
.dtype   = {.f16, .f32};
.atype   = {.s8, .u8};
.btype   = {.s8, .u8};
.ctype   = {.f16, .f32};

Meanwhile, the MMA instructions support m8n8k16, m16n8k16, m16n8k32 for 8-bit matrix multiplication.

mma.sync.aligned.shape.row.col{.satfinite}.s32.atype.btype.s32 d, a, b, c;

.shape   = {.m8n8k16, .m16n8k16, .m16n8k32}
.atype   = {.u8, .s8};
.btype   = {.u8, .s8};

Why do they support completely different matrix size configurations? What’s the difference between WMMA and MMA anyway? Which configuration should I use, if I have m = 32, n = 256, and k = 256?

I don’t have an exact answer for you. It is self-evident that the tensor core unit(s) have become more capable, and therefore more complicated, as architectural generations progressed (7.x → 8.x → 9.x etc.) If you go back to CUDA 9.0 PTX guide, I think you will not find mma and wgmma variants, just the wmma variant. So I suspect that as the TC units became more complicated, whatever formatting, syntax, and semantic meanings already “standardized” were insufficient, and so a new instruction type was exposed (mma, and then later wgmma).

Some differences between wmma and mma are described here.

None of the instruction variants directly support much larger sizes like m=32, n=256, k=256. So you will need to decompose that problem, if you intend to roll your own. You might also be interested in CUTLASS.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.