In my last topic, i tried to calculate the number of flops when executing YOLOv3 TensorRT using the tool trtexec for the inference and using nvprof for the number of flops per kernel, as i said in the topic i metionned above, there was a huge difference in the result of the number of flops between FP32 and FP16 implementation.
The main precision between the two precision mode is the use of TensorCores (h884cudnn) for convolution layers, i am wondering if i have to multiply each number of flops by 8 8 4 for kernels used by TensorCore for the final result ?
#flop_count_sp = 50559409362 flops
#flop_count_hp = 0 flops
#flop_count_sp = 40493448 flops
#flop_count_hp = 706028298 flop
Jetpack Version : 4.5.1
Board : NVIDIA Jetson AGX Xavier
TensorRT Version : 7.1.3
GPU Type : Volta 512 CUDA Cores, 64 Tensor Cores
Nvidia Driver Version :
CUDA Version : 10.2
CUDNN Version : 8.0