I am trying to get the peak theoretical FP32 FLOP/s of the Jetson AGX Orin.
The documentation mentions 170 INT8 Sparse TOP/s, from which I get 85 INT8 Dense TOP/s, and then 42 FP16 FLOP/s and 21 FP32 FLOP/s. So is the peak performance 21 FLOP/s for FP32?
Even if I use FP32 matmuls in TensorRT or PyTorch, this will automatically be executed using tensor cores because they support TF32 right?
Because I see another source mentioning 5.3 FP32 FLOP/s for CUDA cores, and I want to know which of these comes into play for deep learning workloads. Thanks!
Hi,
Tensor core requires INT8 and FP16 precision.
You can find some details below:
You can also try our CUTLASS library to run a benchmark.
Thanks.
Accelerating AI Training with NVIDIA TF32 Tensor Cores | NVIDIA Technical Blog.
This says “TF32 mode is the default option for AI training with 32-bit variables on Ampere GPU architecture. It brings Tensor Core acceleration to single-precision DL workloads, without needing any changes to model scripts.”
Doesn’t this mean that FP32 DL code will run on tensor cores?
Hi,
Sorry for the missing, TF32 can work on Tensor Core.
But please note that TF32 and FP32 are different.
For example, you can find below the document for TensorRT:
As TensorRT chooses algorithms based on resources and performance, there is no guarantee that a layer will run on Tensor Core.
Thanks.