# Why does WMMA and MMA support different matrix tile size?

I want to do 8-bit matrix multiplications with tensor cores on A6000. According to the PTX guide, the WMMA instructions support `m16n16k16`, `m8n32k16`, and `m32n8k16` for 8-bit matrix multiplication.

``````// Integer (.u8/.s8 multiplicands) wmma.mma
wmma.mma.sync.aligned.alayout.blayout.shape.s32.atype.btype.s32{.satfinite} d, a, b, c;

.alayout = {.row, .col};
.blayout = {.row, .col};
.shape  =  {.m16n16k16, .m8n32k16, .m32n8k16};
.dtype   = {.f16, .f32};
.atype   = {.s8, .u8};
.btype   = {.s8, .u8};
.ctype   = {.f16, .f32};
``````

Meanwhile, the MMA instructions support `m8n8k16`, `m16n8k16`, `m16n8k32` for 8-bit matrix multiplication.

``````mma.sync.aligned.shape.row.col{.satfinite}.s32.atype.btype.s32 d, a, b, c;

.shape   = {.m8n8k16, .m16n8k16, .m16n8k32}
.atype   = {.u8, .s8};
.btype   = {.u8, .s8};
``````

Why do they support completely different matrix size configurations? Whatâ€™s the difference between WMMA and MMA anyway? Which configuration should I use, if I have m = 32, n = 256, and k = 256?

I donâ€™t have an exact answer for you. It is self-evident that the tensor core unit(s) have become more capable, and therefore more complicated, as architectural generations progressed (7.x â†’ 8.x â†’ 9.x etc.) If you go back to CUDA 9.0 PTX guide, I think you will not find mma and wgmma variants, just the wmma variant. So I suspect that as the TC units became more complicated, whatever formatting, syntax, and semantic meanings already â€śstandardizedâ€ť were insufficient, and so a new instruction type was exposed (mma, and then later wgmma).

Some differences between wmma and mma are described here.

None of the instruction variants directly support much larger sizes like m=32, n=256, k=256. So you will need to decompose that problem, if you intend to roll your own. You might also be interested in CUTLASS.

