How to calculate the final result of the number of flops using nvprof

Hi all,

In my last topic, i tried to calculate the number of flops when executing YOLOv3 TensorRT using the tool trtexec for the inference and using nvprof for the number of flops per kernel, as i said in the topic i metionned above, there was a huge difference in the result of the number of flops between FP32 and FP16 implementation.

The main precision between the two precision mode is the use of TensorCores (h884cudnn) for convolution layers, i am wondering if i have to multiply each number of flops by 8 8 4 for kernels used by TensorCore for the final result ?


#flop_count_sp = 50559409362 flops
#flop_count_hp = 0 flops


#flop_count_sp = 40493448 flops
#flop_count_hp = 706028298 flop


Jetpack Version : 4.5.1
Board : NVIDIA Jetson AGX Xavier
TensorRT Version : 7.1.3
GPU Type : Volta 512 CUDA Cores, 64 Tensor Cores
Nvidia Driver Version :
CUDA Version : 10.2
CUDNN Version : 8.0



We are checking the details internally.
Will get back to you later.



Sorry for the late update.
Let’s check this issue on Why the number of flops is different between FP32 and FP16 mode with YOLOv3 TensorRT implementation?