What is the QMMA instruction? I cannot find it in CUDA Binary Utilities 12.2 documentation.
BTW, how can I write cuda code (c++ or ptx) to use FP8 Matrix Multiply and Accumulate on Ada GPUs? In ptx 8.2 document I cannot find any clue about this.
This instruction is not publicly documented anywhere and accessible only indirectly through some of NVIDIA’s pre-packaged software. Your guess is as good as mine as to why that might be the case. As noted in the question, the latest PTX documentation does not mention any
MMA-type instructions with
fp8 support at all, so CUDA programmers cannot write code for that right now.
Presumably the mnemonic
QMMA stands for quarter-precision matrix multiply-accumulate. A reasonable hypothesis is that
.f32.e4m3.e4m3 specifies that
fp8 (quarter-precision) elements using the
E4M3 format (as opposed to the
E5M2 format) are multiplied and the products accumulated in an
fp32 (single precision) sum. The balance of the suffixes probably describe the organization of the matrices, so
.16832 may stand for m=16, n=8, k=32.
There is currently no description of fp8 mma in both Nvidia ptx and sass spec. I think you can try to translate the fp8 matmul implementation in cublas into ptx mode. Maybe you can find out some ptx isa like
wmma.mma.sync.aligned.alayout.blayout.shape.dtype.ctype d, a, b, c; and you can use ptx, follow some other type’s sample in nvidia tensor-core cuda-samples.