What is the QMMA instruction? I cannot find it in CUDA Binary Utilities 12.2 documentation.
BTW, how can I write cuda code (c++ or ptx) to use FP8 Matrix Multiply and Accumulate on Ada GPUs? In ptx 8.2 document I cannot find any clue about this.
This instruction is not publicly documented anywhere and accessible only indirectly through some of NVIDIA’s pre-packaged software. Your guess is as good as mine as to why that might be the case. As noted in the question, the latest PTX documentation does not mention any MMA
-type instructions with fp8
support at all, so CUDA programmers cannot write code for that right now.
Presumably the mnemonic QMMA
stands for quarter-precision matrix multiply-accumulate. A reasonable hypothesis is that .f32.e4m3.e4m3
specifies that fp8
(quarter-precision) elements using the E4M3
format (as opposed to the E5M2
format) are multiplied and the products accumulated in an fp32
(single precision) sum. The balance of the suffixes probably describe the organization of the matrices, so .16832
may stand for m=16, n=8, k=32.
There is currently no description of fp8 mma in both Nvidia ptx and sass spec. I think you can try to translate the fp8 matmul implementation in cublas into ptx mode. Maybe you can find out some ptx isa like wmma.mma.sync.aligned.alayout.blayout.shape.dtype.ctype d, a, b, c;
and you can use ptx, follow some other type’s sample in nvidia tensor-core cuda-samples.