I want to do 8-bit matrix multiplications with tensor cores on A6000. According to the PTX guide, the WMMA instructions support m16n16k16
, m8n32k16
, and m32n8k16
for 8-bit matrix multiplication.
// Integer (.u8/.s8 multiplicands) wmma.mma
wmma.mma.sync.aligned.alayout.blayout.shape.s32.atype.btype.s32{.satfinite} d, a, b, c;
.alayout = {.row, .col};
.blayout = {.row, .col};
.shape = {.m16n16k16, .m8n32k16, .m32n8k16};
.dtype = {.f16, .f32};
.atype = {.s8, .u8};
.btype = {.s8, .u8};
.ctype = {.f16, .f32};
Meanwhile, the MMA instructions support m8n8k16
, m16n8k16
, m16n8k32
for 8-bit matrix multiplication.
mma.sync.aligned.shape.row.col{.satfinite}.s32.atype.btype.s32 d, a, b, c;
.shape = {.m8n8k16, .m16n8k16, .m16n8k32}
.atype = {.u8, .s8};
.btype = {.u8, .s8};
Why do they support completely different matrix size configurations? What’s the difference between WMMA and MMA anyway? Which configuration should I use, if I have m = 32, n = 256, and k = 256?