Hi all,
I ran YOLOv3 with TensorRT using NVIDIA Sample yolov3_onnx in FP32 and FP16 mode and i used nvprof to get the number of FLOPS in each precision mode, but the number of FLOPS between FP32 and FP16 was different:
YOLOv3 TRT FP32
Number of FLOPS per kernel
==28084== Profiling result:
==28084== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Xavier (0)"
Kernel: trt_volta_scudnn_128x32_relu_interior_nn_v1
1 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
1 flop_count_sp Floating Point Operations(Single Precision) 179978240 179978240 179978240
Kernel: trt_volta_scudnn_128x128_relu_interior_nn_v1
7 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
7 flop_count_sp Floating Point Operations(Single Precision) 268697600 268697600 268697600
Kernel: generatedNativePointwise
72 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
72 flop_count_sp Floating Point Operations(Single Precision) 43264 5537792 733685
Kernel: trt_volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148t_nt_v1
32 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
32 flop_count_sp Floating Point Operations(Single Precision) 765599744 1121189888 1043433216
Kernel: void cuResizeLayer::ResizeNearestGenericKernel<float, int=2, int=32>(float*, cuResizeLayer::ResizeNearestGenericKernel<float, int=2, int=32> const *, cuResizeLayer::LaunchParams)
2 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
2 flop_count_sp Floating Point Operations(Single Precision) 346112 692224 519168
Kernel: trt_volta_scudnn_128x128_relu_small_nn_v1
5 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
5 flop_count_sp Floating Point Operations(Single Precision) 50528256 2416443392 1232869785
Kernel: trt_volta_scudnn_128x32_relu_small_nn_v1
1 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
1 flop_count_sp Floating Point Operations(Single Precision) 321191936 321191936 321191936
Kernel: trt_volta_scudnn_128x64_relu_interior_nn_v1
27 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
27 flop_count_sp Floating Point Operations(Single Precision) 67239936 370540544 196497559
Kernel: trt_volta_scudnn_128x64_relu_small_nn_v1
1 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
1 flop_count_sp Floating Point Operations(Single Precision) 1662386176 1662386176 1662386176
Kernel: trt_volta_scudnn_128x64_relu_xregs_large_nn_v1
1 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
1 flop_count_sp Floating Point Operations(Single Precision) 1600421888 1600421888 1600421888
Kernels FP32 used
==28196== Profiling application: python3 /home/mkartobi/yolov3_onnx/onnx_to_tensorrt.py -i dog -m yolov3 -r 416 -p FP32 -b 1
==28196== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 39.45% 84.491ms 6 14.082ms 1.0240us 84.318ms [CUDA memcpy HtoD]
30.76% 65.890ms 152 433.48us 5.1520us 8.5030ms [CUDA memcpy DtoD]
17.96% 38.466ms 32 1.2021ms 848.96us 1.3948ms trt_volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148t_nt_v1
3.48% 7.4574ms 5 1.4915ms 65.509us 2.6534ms trt_volta_scudnn_128x128_relu_small_nn_v1
2.76% 5.9128ms 27 218.99us 92.168us 290.49us trt_volta_scudnn_128x64_relu_interior_nn_v1
2.32% 4.9627ms 72 68.925us 5.1210us 552.01us generatedNativePointwise
0.92% 1.9796ms 1 1.9796ms 1.9796ms 1.9796ms trt_volta_scudnn_128x64_relu_xregs_large_nn_v1
0.92% 1.9611ms 1 1.9611ms 1.9611ms 1.9611ms trt_volta_scudnn_128x64_relu_small_nn_v1
0.91% 1.9430ms 7 277.57us 219.15us 325.66us trt_volta_scudnn_128x128_relu_interior_nn_v1
0.27% 579.66us 1 579.66us 579.66us 579.66us trt_volta_scudnn_128x32_relu_small_nn_v1
0.13% 281.65us 1 281.65us 281.65us 281.65us trt_volta_scudnn_128x32_relu_interior_nn_v1
0.06% 134.16us 2 67.077us 47.108us 87.047us void cuResizeLayer::ResizeNearestGenericKernel<float, int=2, int=32>(float*, cuResizeLayer::ResizeNearestGenericKernel<float, int=2, int=32> const *, cuResizeLayer::LaunchParams)
0.05% 100.97us 3 33.656us 6.7520us 74.566us [CUDA memcpy DtoH]
0.01% 15.137us 8 1.8920us 1.0240us 3.1040us [CUDA memset]
And to calculate the FLOPS I multiplied the average flops of each kernel by the number of the call and did the sum to get the flops size : 63 GFlops (in FP32)
YOLOv3 TRT FP16
Number of FLOPS per kernel
==1869== Profiling result:
==1869== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Xavier (0)"
Kernel: trt_volta_h884cudnn_256x64_ldg8_relu_exp_interior_nhwc_tn_v1
1 flop_count_hp Floating Point Operations(Half Precision) 5883904 5883904 5883904
1 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
1 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
1 flop_hp_efficiency FLOP Efficiency(Peak Half) 1.69% 1.69% 1.69%
Kernel: generatedNativePointwise
72 flop_count_hp Floating Point Operations(Half Precision) 0 5537792 656771
72 flop_count_sp Floating Point Operations(Single Precision) 18134 5537792 587567
72 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.12% 2.01% 0.74%
72 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 1.18% 0.46%
Kernel: trt_volta_h884cudnn_256x64_sliced1x2_ldg8_relu_exp_small_nhwc_tn_v1
2 flop_count_hp Floating Point Operations(Half Precision) 409600 2252800 1331200
2 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
2 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
2 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.46% 1.05% 0.75%
Kernel: trt_volta_h884cudnn_256x128_ldg8_relu_exp_medium_nhwc_tn_v1
25 flop_count_hp Floating Point Operations(Half Precision) 417792 2994176 1061191
25 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
25 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
25 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.07% 0.44% 0.27%
Kernel: void cuInt8::nhwcTonchw<__half, int=32, int=32, int=2>(__half const *, __half*, int, int, int, int, int, int)
2 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
2 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
2 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
2 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
Kernel: trt_volta_h884cudnn_256x128_ldg8_relu_exp_small_nhwc_tn_v1
7 flop_count_hp Floating Point Operations(Half Precision) 557056 557056 557056
7 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
7 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
7 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.07% 0.07% 0.07%
Kernel: void cuResizeLayer::ResizeNearestGenericKernel<__half, int=2, int=32>(__half*, cuResizeLayer::ResizeNearestGenericKernel<__half, int=2, int=32> const *, cuResizeLayer::LaunchParams)
2 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
2 flop_count_sp Floating Point Operations(Single Precision) 346112 692224 519168
2 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.51% 0.57% 0.54%
2 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
Kernel: trt_volta_h884cudnn_128x128_ldg8_relu_exp_interior_nhwc_tn_v1
10 flop_count_hp Floating Point Operations(Half Precision) 765952 765952 765952
10 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
10 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
10 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.46% 0.55% 0.51%
Kernel: trt_volta_scudnn_128x32_relu_small_nn_v1
1 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
1 flop_count_sp Floating Point Operations(Single Precision) 321191936 321191936 321191936
1 flop_sp_efficiency FLOP Efficiency(Peak Single) 60.25% 60.25% 60.25%
1 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
Kernel: trt_volta_h884cudnn_256x64_sliced1x2_ldg8_relu_exp_interior_nhwc_tn_v1
9 flop_count_hp Floating Point Operations(Half Precision) 409600 819200 750933
9 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
9 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
9 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.30% 0.87% 0.59%
Kernel: trt_volta_h884cudnn_256x64_ldg8_relu_exp_medium_nhwc_tn_v1
1 flop_count_hp Floating Point Operations(Half Precision) 5883904 5883904 5883904
1 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
1 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
1 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.62% 0.62% 0.62%
Kernel: trt_volta_h884cudnn_256x64_ldg8_relu_exp_small_nhwc_tn_v1
3 flop_count_hp Floating Point Operations(Half Precision) 1497088 5883904 2959360
3 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
3 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
3 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.79% 0.82% 0.81%
Kernel: void cuInt8::nchwTonhwc<float, int=32, int=16, int=2>(float const *, __half*, int, int, int, int, int, int, int, int)
1 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
1 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
1 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
1 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
Kernel: void cuInt8::nhwcTonchw<float, int=32, int=32, int=2>(__half const *, float*, int, int, int, int, int, int)
3 flop_count_hp Floating Point Operations(Half Precision) 43191 690880 302267
3 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
3 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
3 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.15% 0.48% 0.32%
Kernel: trt_volta_h884cudnn_128x128_ldg8_relu_exp_small_nhwc_tn_v1
13 flop_count_hp Floating Point Operations(Half Precision) 835584 2959360 1162318
13 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
13 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
13 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.10% 0.45% 0.16%
Kernel: trt_volta_h884cudnn_128x128_ldg8_relu_exp_medium_nhwc_tn_v1
2 flop_count_hp Floating Point Operations(Half Precision) 835584 1531904 1183744
2 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
2 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
2 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.09% 0.64% 0.37%
Kernel: trt_volta_h884cudnn_256x128_ldg8_relu_exp_interior_nhwc_tn_v1
1 flop_count_hp Floating Point Operations(Half Precision) 417792 417792 417792
1 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
1 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
1 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.32% 0.32% 0.32%
Kernel: void cuInt8::nchwTonhwc<__half, int=32, int=32, int=2>(__half const *, __half*, int, int, int, int, int, int, int, int)
2 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
2 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
2 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
2 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
Kernels FP16 used
==1703== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 24.41% 18.679ms 150 124.53us 12.001us 653.66us [CUDA memcpy DtoD]
22.22% 17.006ms 6 2.8343ms 1.2480us 16.758ms [CUDA memcpy HtoD]
16.56% 12.674ms 25 506.96us 61.317us 2.3652ms trt_volta_h884cudnn_256x128_ldg8_relu_exp_medium_nhwc_tn_v1
9.64% 7.3746ms 72 102.42us 6.3050us 1.4953ms generatedNativePointwise
8.70% 6.6591ms 13 512.24us 360.67us 917.87us trt_volta_h884cudnn_128x128_ldg8_relu_exp_small_nhwc_tn_v1
4.32% 3.3025ms 7 471.79us 401.89us 858.22us trt_volta_h884cudnn_256x128_ldg8_relu_exp_small_nhwc_tn_v1
3.60% 2.7583ms 10 275.83us 66.534us 1.4785ms trt_volta_h884cudnn_128x128_ldg8_relu_exp_interior_nhwc_tn_v1
2.00% 1.5334ms 1 1.5334ms 1.5334ms 1.5334ms trt_volta_scudnn_128x32_relu_small_nn_v1
1.98% 1.5178ms 3 505.92us 236.60us 1.0432ms trt_volta_h884cudnn_256x64_ldg8_relu_exp_small_nhwc_tn_v1
1.84% 1.4064ms 1 1.4064ms 1.4064ms 1.4064ms trt_volta_h884cudnn_256x64_ldg8_relu_exp_medium_nhwc_tn_v1
1.49% 1.1377ms 1 1.1377ms 1.1377ms 1.1377ms void cuInt8::nchwTonhwc<float, int=32, int=16, int=2>(float const *, __half*, int, int, int, int, int, int, int, int)
0.82% 630.74us 2 315.37us 125.93us 504.81us trt_volta_h884cudnn_128x128_ldg8_relu_exp_medium_nhwc_tn_v1
0.74% 567.83us 9 63.091us 30.819us 71.206us trt_volta_h884cudnn_256x64_sliced1x2_ldg8_relu_exp_interior_nhwc_tn_v1
0.63% 482.03us 1 482.03us 482.03us 482.03us trt_volta_h884cudnn_256x64_ldg8_relu_exp_interior_nhwc_tn_v1
0.29% 218.20us 2 109.10us 74.438us 143.76us void cuResizeLayer::ResizeNearestGenericKernel<__half, int=2, int=32>(__half*, cuResizeLayer::ResizeNearestGenericKernel<__half, int=2, int=32> const *, cuResizeLayer::LaunchParams)
0.22% 167.92us 3 55.973us 10.849us 125.07us [CUDA memcpy DtoH]
0.19% 148.69us 2 74.342us 37.603us 111.08us trt_volta_h884cudnn_256x64_sliced1x2_ldg8_relu_exp_small_nhwc_tn_v1
0.13% 97.063us 3 32.354us 8.8000us 66.757us void cuInt8::nhwcTonchw<float, int=32, int=32, int=2>(__half const *, float*, int, int, int, int, int, int)
0.08% 63.590us 1 63.590us 63.590us 63.590us trt_volta_h884cudnn_256x128_ldg8_relu_exp_interior_nhwc_tn_v1
0.08% 60.325us 2 30.162us 20.097us 40.228us void cuInt8::nchwTonhwc<__half, int=32, int=32, int=2>(__half const *, __half*, int, int, int, int, int, int, int, int)
0.03% 20.226us 2 10.113us 8.7370us 11.489us void cuInt8::nhwcTonchw<__half, int=32, int=32, int=2>(__half const *, __half*, int, int, int, int, int, int)
0.02% 19.011us 8 2.3760us 1.7920us 4.0010us [CUDA memset]
And to calculate the FLOPS I multiplied the average flops of each kernel by the number of the call and did the sum to get the flops size : 500 MFlops (in FP16)
Environment
Jetpack Version: 4.5.1
Board : NVIDIA Jetson AGX Xavier
TensorRT Version: 7.1.3
GPU Type: Volta 512 CUDA Cores, 64 Tensor Cores
Nvidia Driver Version:
CUDA Version: 10.2
CUDNN Version: 8.0
Flops in FP32 = 50 GFlops
Flops in FP16 = 500 MFlops
The question is : why do i have this huge difference between FP32 and FP16 in size of operations despite of using the same algorithme (YOLOv3), is the using of TensorCore in the FP16 decreases the use of the number of the flotting operations ?
Thanks