Cublaslt fp8 SASS instruction QMMA

mjdyx · June 30, 2023, 8:28am

What is the QMMA instruction? I cannot find it in CUDA Binary Utilities 12.2 documentation.
BTW, how can I write cuda code (c++ or ptx) to use FP8 Matrix Multiply and Accumulate on Ada GPUs? In ptx 8.2 document I cannot find any clue about this.

njuffa · June 30, 2023, 9:36am

This instruction is not publicly documented anywhere and accessible only indirectly through some of NVIDIA’s pre-packaged software. Your guess is as good as mine as to why that might be the case. As noted in the question, the latest PTX documentation does not mention any MMA-type instructions with fp8 support at all, so CUDA programmers cannot write code for that right now.

Presumably the mnemonic QMMA stands for quarter-precision matrix multiply-accumulate. A reasonable hypothesis is that .f32.e4m3.e4m3 specifies that fp8 (quarter-precision) elements using the E4M3 format (as opposed to the E5M2 format) are multiplied and the products accumulated in an fp32 (single precision) sum. The balance of the suffixes probably describe the organization of the matrices, so .16832 may stand for m=16, n=8, k=32.

stevvvvvvv · July 3, 2023, 2:50am

There is currently no description of fp8 mma in both Nvidia ptx and sass spec. I think you can try to translate the fp8 matmul implementation in cublas into ptx mode. Maybe you can find out some ptx isa like wmma.mma.sync.aligned.alayout.blayout.shape.dtype.ctype d, a, b, c; and you can use ptx, follow some other type’s sample in nvidia tensor-core cuda-samples.

Topic		Replies	Views
FP8 WMMA kernel compilation error GPU-Accelerated Libraries cublas	9	1886	March 26, 2023
Fp8/fp16 accumulation on ada RTX 4090 GPU-Accelerated Libraries cuda , cublas	2	1099	June 5, 2024
Direct access to Volta HMMA instruction CUDA Programming and Performance	9	5223	December 19, 2017
Ampere single-bit MMA missing from Linux SDK (v11.6)? CUDA Programming and Performance	3	442	March 31, 2022
Undocumented PTX instruction `fma.rn.f16` CUDA Programming and Performance	3	303	April 5, 2024
Programming CUDA at 'assembler' level? CUDA Programming and Performance	9	13489	November 7, 2010
instruction or operation CUDA Programming and Performance	16	3241	March 28, 2019
Ampere 16x8x256 BMMA CUDA Programming and Performance	4	1844	May 20, 2022
Disable Fused Multiply-Add(FMA) with Numba CUDA Programming and Performance	7	2361	March 16, 2017
Ada GeForce (RTX 4090) FP8 cuBLASLt performance GPU-Accelerated Libraries cublas	7	12383	November 2, 2023

Cublaslt fp8 SASS instruction QMMA

Related topics