Cublaslt fp8 SASS instruction QMMA


What is the QMMA instruction? I cannot find it in CUDA Binary Utilities 12.2 documentation.
BTW, how can I write cuda code (c++ or ptx) to use FP8 Matrix Multiply and Accumulate on Ada GPUs? In ptx 8.2 document I cannot find any clue about this.

This instruction is not publicly documented anywhere and accessible only indirectly through some of NVIDIA’s pre-packaged software. Your guess is as good as mine as to why that might be the case. As noted in the question, the latest PTX documentation does not mention any MMA-type instructions with fp8 support at all, so CUDA programmers cannot write code for that right now.

Presumably the mnemonic QMMA stands for quarter-precision matrix multiply-accumulate. A reasonable hypothesis is that .f32.e4m3.e4m3 specifies that fp8 (quarter-precision) elements using the E4M3 format (as opposed to the E5M2 format) are multiplied and the products accumulated in an fp32 (single precision) sum. The balance of the suffixes probably describe the organization of the matrices, so .16832 may stand for m=16, n=8, k=32.

1 Like

There is currently no description of fp8 mma in both Nvidia ptx and sass spec. I think you can try to translate the fp8 matmul implementation in cublas into ptx mode. Maybe you can find out some ptx isa like wmma.mma.sync.aligned.alayout.blayout.shape.dtype.ctype d, a, b, c; and you can use ptx, follow some other type’s sample in nvidia tensor-core cuda-samples.