Accelerating AI Training with NVIDIA TF32 Tensor Cores

Originally published at: https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/

NVIDIA Ampere GPU architecture introduced the third generation of Tensor Cores, with the new TensorFloat32 (TF32) mode for accelerating FP32 convolutions and matrix multiplications. TF32 mode is the default option for AI training with 32-bit variables on Ampere GPU architecture. It brings Tensor Core acceleration to single-precision DL workloads, without needing any changes to model…

Open source library CUTLASS (GitHub - NVIDIA/cutlass: CUDA Templates for Linear Algebra Subroutines) has all the details of TF32 (numeric representation, rounding, math operation, etc.) as well as TF32 GEMM/CONV source code.