Unlocking Tensor Core Performance with Floating Point Emulation in cuBLAS

Originally published at: Unlocking Tensor Core Performance with Floating Point Emulation in cuBLAS | NVIDIA Technical Blog

NVIDIA CUDA-X math libraries provide the fundamental numerical building blocks that enable developers to deploy accelerated applications across multiple high-performance domains, including AI and scientific computing. cuBLAS is a CUDA-X math library that consists of a highly optimized collection of basic linear algebra subroutines for matrix and vector operations that are specifically tuned to get…

Yes, I can see that the BF16x9 matmul emulation algorithm is finally part of the CudaToolkit 13 (Update 2). However, from the cuBLAS docs, the BF16x9 would be only effective on devices with compute capabilities of 10.x (Hopper chips I think)

Questions:

  • Is there a way to produce a similar matmul FP32 emulation on 8.x GPUs like A100?
  • By any chance, do you have any Heat maps results for matmul FP64 on Ampere GPUs?