Accelerating AI Training with NVIDIA TF32 Tensor Cores

Originally published at: Accelerating AI Training with NVIDIA TF32 Tensor Cores | NVIDIA Developer Blog

NVIDIA Ampere GPU architecture introduced the third generation of Tensor Cores, with the new TensorFloat32 (TF32) mode for accelerating FP32 convolutions and matrix multiplications. TF32 mode is the default option for AI training with 32-bit variables on Ampere GPU architecture. It brings Tensor Core acceleration to single-precision DL workloads, without needing any changes to model…

Open source library CUTLASS (GitHub - NVIDIA/cutlass: CUDA Templates for Linear Algebra Subroutines) has all the details of TF32 (numeric representation, rounding, math operation, etc.) as well as TF32 GEMM/CONV source code.