I want to do 8-bit matrix multiplications with tensor cores on A6000. According to the PTX guide, the WMMA instructions support `m16n16k16`

, `m8n32k16`

, and `m32n8k16`

for 8-bit matrix multiplication.

```
// Integer (.u8/.s8 multiplicands) wmma.mma
wmma.mma.sync.aligned.alayout.blayout.shape.s32.atype.btype.s32{.satfinite} d, a, b, c;
.alayout = {.row, .col};
.blayout = {.row, .col};
.shape = {.m16n16k16, .m8n32k16, .m32n8k16};
.dtype = {.f16, .f32};
.atype = {.s8, .u8};
.btype = {.s8, .u8};
.ctype = {.f16, .f32};
```

Meanwhile, the MMA instructions support `m8n8k16`

, `m16n8k16`

, `m16n8k32`

for 8-bit matrix multiplication.

```
mma.sync.aligned.shape.row.col{.satfinite}.s32.atype.btype.s32 d, a, b, c;
.shape = {.m8n8k16, .m16n8k16, .m16n8k32}
.atype = {.u8, .s8};
.btype = {.u8, .s8};
```

Why do they support completely different matrix size configurations? Whatâ€™s the difference between WMMA and MMA anyway? Which configuration should I use, if I have m = 32, n = 256, and k = 256?