Fp8/fp16 accumulation on ada RTX 4090

This old post has been resolved with the updated Whitepaper, confirming fp32 accumulate is half-rate on geforce, but does cublas (cublasLtMatmul()) support fp8 with fp16 accumulate now? The fp8 example breaks when I change CUBLAS_COMPUTE_32F to CUBLAS_COMPUTE_16F (Line 71), returning “cuBLAS API failed with status 15”, so I assume it’s not supported, but I want to make sure.

From cublasLtMatmul()

To use FP8 kernels, the following set of requirements must be satisfied:

  • All matrix pointers must be 16-byte aligned.
  • A must be transposed and B non-transposed (The “TN” format).
  • The compute type must be CUBLAS_COMPUTE_32F.
  • The scale type must be CUDA_R_32F.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.