HI There,
On blackwell RTX 5080 (120a), a warp can run m16n8k32 tile using tensor core for FP8. (cutlass/include/cute/arch/mma_sm120.hpp at main · NVIDIA/cutlass · GitHub)
asm volatile(
"mma.sync.aligned.kind::mxf8f6f4.block_scale.scale_vec::1X.m16n8k32.row.col.f32.e4m3.e4m3.f32.ue8m0 "
"{%0, %1, %2, %3},"
"{%4, %5, %6, %7},"
"{%8, %9},"
"{%10, %11, %12, %13},"
"{%14},"
"{%15, %16},"
"{%17},"
"{%18, %19};\n"
: "=f"(d0), "=f"(d1), "=f"(d2), "=f"(d3)
: "r"(a0), "r"(a1), "r"(a2), "r"(a3),
"r"(b0), "r"(b1),
"f"(c0), "f"(c1), "f"(c2), "f"(c3),
"r"(uint32_t(sfa0)) , "h"(bidA), "h"(tidA),
"r"(uint32_t(sfb0)) , "h"(bidB), "h"(tidB));
I tried to use ldmatrix for matrix B whose size is 8 x 32. However, there is no suitable tile size found for this b8 case. (1. Introduction — PTX ISA 8.7 documentation)
Any suggestions?
Thanks in advance!
.shape | Matrix shape | Element size |
---|---|---|
.m8n8 |
8x8 | 16-bit |
.m16n16 |
16x16 | 8-bit or 6-bit or 4-bit |
.m8n16 |
8x16 | 6-bit or 4-bit |