Why the number of flops is different between FP32 and FP16 mode with YOLOv3 TensorRT implementation?

Hi all,

I ran YOLOv3 with TensorRT using NVIDIA Sample yolov3_onnx in FP32 and FP16 mode and i used nvprof to get the number of FLOPS in each precision mode, but the number of FLOPS between FP32 and FP16 was different:

YOLOv3 TRT FP32

Number of FLOPS per kernel

==28084== Profiling result:
==28084== Metric result:
Invocations                               Metric Name                            Metric Description         Min         Max         Avg
Device "Xavier (0)"
    Kernel: trt_volta_scudnn_128x32_relu_interior_nn_v1
          1                             flop_count_hp     Floating Point Operations(Half Precision)           0           0           0
          1                             flop_count_sp   Floating Point Operations(Single Precision)   179978240   179978240   179978240
    Kernel: trt_volta_scudnn_128x128_relu_interior_nn_v1
          7                             flop_count_hp     Floating Point Operations(Half Precision)           0           0           0
          7                             flop_count_sp   Floating Point Operations(Single Precision)   268697600   268697600   268697600
    Kernel: generatedNativePointwise
         72                             flop_count_hp     Floating Point Operations(Half Precision)           0           0           0
         72                             flop_count_sp   Floating Point Operations(Single Precision)       43264     5537792      733685
    Kernel: trt_volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148t_nt_v1
         32                             flop_count_hp     Floating Point Operations(Half Precision)           0           0           0
         32                             flop_count_sp   Floating Point Operations(Single Precision)   765599744  1121189888  1043433216
    Kernel: void cuResizeLayer::ResizeNearestGenericKernel<float, int=2, int=32>(float*, cuResizeLayer::ResizeNearestGenericKernel<float, int=2, int=32> const *, cuResizeLayer::LaunchParams)
          2                             flop_count_hp     Floating Point Operations(Half Precision)           0           0           0
          2                             flop_count_sp   Floating Point Operations(Single Precision)      346112      692224      519168
    Kernel: trt_volta_scudnn_128x128_relu_small_nn_v1
          5                             flop_count_hp     Floating Point Operations(Half Precision)           0           0           0
          5                             flop_count_sp   Floating Point Operations(Single Precision)    50528256  2416443392  1232869785
    Kernel: trt_volta_scudnn_128x32_relu_small_nn_v1
          1                             flop_count_hp     Floating Point Operations(Half Precision)           0           0           0
          1                             flop_count_sp   Floating Point Operations(Single Precision)   321191936   321191936   321191936
    Kernel: trt_volta_scudnn_128x64_relu_interior_nn_v1
         27                             flop_count_hp     Floating Point Operations(Half Precision)           0           0           0
         27                             flop_count_sp   Floating Point Operations(Single Precision)    67239936   370540544   196497559
    Kernel: trt_volta_scudnn_128x64_relu_small_nn_v1
          1                             flop_count_hp     Floating Point Operations(Half Precision)           0           0           0
          1                             flop_count_sp   Floating Point Operations(Single Precision)  1662386176  1662386176  1662386176
    Kernel: trt_volta_scudnn_128x64_relu_xregs_large_nn_v1
          1                             flop_count_hp     Floating Point Operations(Half Precision)           0           0           0
          1                             flop_count_sp   Floating Point Operations(Single Precision)  1600421888  1600421888  1600421888

Kernels FP32 used

==28196== Profiling application: python3 /home/mkartobi/yolov3_onnx/onnx_to_tensorrt.py -i dog -m yolov3 -r 416 -p FP32 -b 1
==28196== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   39.45%  84.491ms         6  14.082ms  1.0240us  84.318ms  [CUDA memcpy HtoD]
                   30.76%  65.890ms       152  433.48us  5.1520us  8.5030ms  [CUDA memcpy DtoD]
                   17.96%  38.466ms        32  1.2021ms  848.96us  1.3948ms  trt_volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148t_nt_v1
                    3.48%  7.4574ms         5  1.4915ms  65.509us  2.6534ms  trt_volta_scudnn_128x128_relu_small_nn_v1
                    2.76%  5.9128ms        27  218.99us  92.168us  290.49us  trt_volta_scudnn_128x64_relu_interior_nn_v1
                    2.32%  4.9627ms        72  68.925us  5.1210us  552.01us  generatedNativePointwise
                    0.92%  1.9796ms         1  1.9796ms  1.9796ms  1.9796ms  trt_volta_scudnn_128x64_relu_xregs_large_nn_v1
                    0.92%  1.9611ms         1  1.9611ms  1.9611ms  1.9611ms  trt_volta_scudnn_128x64_relu_small_nn_v1
                    0.91%  1.9430ms         7  277.57us  219.15us  325.66us  trt_volta_scudnn_128x128_relu_interior_nn_v1
                    0.27%  579.66us         1  579.66us  579.66us  579.66us  trt_volta_scudnn_128x32_relu_small_nn_v1
                    0.13%  281.65us         1  281.65us  281.65us  281.65us  trt_volta_scudnn_128x32_relu_interior_nn_v1
                    0.06%  134.16us         2  67.077us  47.108us  87.047us  void cuResizeLayer::ResizeNearestGenericKernel<float, int=2, int=32>(float*, cuResizeLayer::ResizeNearestGenericKernel<float, int=2, int=32> const *, cuResizeLayer::LaunchParams)
                    0.05%  100.97us         3  33.656us  6.7520us  74.566us  [CUDA memcpy DtoH]
                    0.01%  15.137us         8  1.8920us  1.0240us  3.1040us  [CUDA memset]

And to calculate the FLOPS I multiplied the average flops of each kernel by the number of the call and did the sum to get the flops size : 63 GFlops (in FP32)

YOLOv3 TRT FP16

Number of FLOPS per kernel

==1869== Profiling result:
==1869== Metric result:
Invocations                               Metric Name                            Metric Description         Min         Max         Avg
Device "Xavier (0)"
    Kernel: trt_volta_h884cudnn_256x64_ldg8_relu_exp_interior_nhwc_tn_v1
          1                             flop_count_hp     Floating Point Operations(Half Precision)     5883904     5883904     5883904
          1                             flop_count_sp   Floating Point Operations(Single Precision)           0           0           0
          1                        flop_sp_efficiency                  FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
          1                        flop_hp_efficiency                    FLOP Efficiency(Peak Half)       1.69%       1.69%       1.69%
    Kernel: generatedNativePointwise
         72                             flop_count_hp     Floating Point Operations(Half Precision)           0     5537792      656771
         72                             flop_count_sp   Floating Point Operations(Single Precision)       18134     5537792      587567
         72                        flop_sp_efficiency                  FLOP Efficiency(Peak Single)       0.12%       2.01%       0.74%
         72                        flop_hp_efficiency                    FLOP Efficiency(Peak Half)       0.00%       1.18%       0.46%
    Kernel: trt_volta_h884cudnn_256x64_sliced1x2_ldg8_relu_exp_small_nhwc_tn_v1
          2                             flop_count_hp     Floating Point Operations(Half Precision)      409600     2252800     1331200
          2                             flop_count_sp   Floating Point Operations(Single Precision)           0           0           0
          2                        flop_sp_efficiency                  FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
          2                        flop_hp_efficiency                    FLOP Efficiency(Peak Half)       0.46%       1.05%       0.75%
    Kernel: trt_volta_h884cudnn_256x128_ldg8_relu_exp_medium_nhwc_tn_v1
         25                             flop_count_hp     Floating Point Operations(Half Precision)      417792     2994176     1061191
         25                             flop_count_sp   Floating Point Operations(Single Precision)           0           0           0
         25                        flop_sp_efficiency                  FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
         25                        flop_hp_efficiency                    FLOP Efficiency(Peak Half)       0.07%       0.44%       0.27%
    Kernel: void cuInt8::nhwcTonchw<__half, int=32, int=32, int=2>(__half const *, __half*, int, int, int, int, int, int)
          2                             flop_count_hp     Floating Point Operations(Half Precision)           0           0           0
          2                             flop_count_sp   Floating Point Operations(Single Precision)           0           0           0
          2                        flop_sp_efficiency                  FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
          2                        flop_hp_efficiency                    FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
    Kernel: trt_volta_h884cudnn_256x128_ldg8_relu_exp_small_nhwc_tn_v1
          7                             flop_count_hp     Floating Point Operations(Half Precision)      557056      557056      557056
          7                             flop_count_sp   Floating Point Operations(Single Precision)           0           0           0
          7                        flop_sp_efficiency                  FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
          7                        flop_hp_efficiency                    FLOP Efficiency(Peak Half)       0.07%       0.07%       0.07%
    Kernel: void cuResizeLayer::ResizeNearestGenericKernel<__half, int=2, int=32>(__half*, cuResizeLayer::ResizeNearestGenericKernel<__half, int=2, int=32> const *, cuResizeLayer::LaunchParams)
          2                             flop_count_hp     Floating Point Operations(Half Precision)           0           0           0
          2                             flop_count_sp   Floating Point Operations(Single Precision)      346112      692224      519168
          2                        flop_sp_efficiency                  FLOP Efficiency(Peak Single)       0.51%       0.57%       0.54%
          2                        flop_hp_efficiency                    FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
    Kernel: trt_volta_h884cudnn_128x128_ldg8_relu_exp_interior_nhwc_tn_v1
         10                             flop_count_hp     Floating Point Operations(Half Precision)      765952      765952      765952
         10                             flop_count_sp   Floating Point Operations(Single Precision)           0           0           0
         10                        flop_sp_efficiency                  FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
         10                        flop_hp_efficiency                    FLOP Efficiency(Peak Half)       0.46%       0.55%       0.51%
    Kernel: trt_volta_scudnn_128x32_relu_small_nn_v1
          1                             flop_count_hp     Floating Point Operations(Half Precision)           0           0           0
          1                             flop_count_sp   Floating Point Operations(Single Precision)   321191936   321191936   321191936
          1                        flop_sp_efficiency                  FLOP Efficiency(Peak Single)      60.25%      60.25%      60.25%
          1                        flop_hp_efficiency                    FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
    Kernel: trt_volta_h884cudnn_256x64_sliced1x2_ldg8_relu_exp_interior_nhwc_tn_v1
          9                             flop_count_hp     Floating Point Operations(Half Precision)      409600      819200      750933
          9                             flop_count_sp   Floating Point Operations(Single Precision)           0           0           0
          9                        flop_sp_efficiency                  FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
          9                        flop_hp_efficiency                    FLOP Efficiency(Peak Half)       0.30%       0.87%       0.59%
    Kernel: trt_volta_h884cudnn_256x64_ldg8_relu_exp_medium_nhwc_tn_v1
          1                             flop_count_hp     Floating Point Operations(Half Precision)     5883904     5883904     5883904
          1                             flop_count_sp   Floating Point Operations(Single Precision)           0           0           0
          1                        flop_sp_efficiency                  FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
          1                        flop_hp_efficiency                    FLOP Efficiency(Peak Half)       0.62%       0.62%       0.62%
    Kernel: trt_volta_h884cudnn_256x64_ldg8_relu_exp_small_nhwc_tn_v1
          3                             flop_count_hp     Floating Point Operations(Half Precision)     1497088     5883904     2959360
          3                             flop_count_sp   Floating Point Operations(Single Precision)           0           0           0
          3                        flop_sp_efficiency                  FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
          3                        flop_hp_efficiency                    FLOP Efficiency(Peak Half)       0.79%       0.82%       0.81%
    Kernel: void cuInt8::nchwTonhwc<float, int=32, int=16, int=2>(float const *, __half*, int, int, int, int, int, int, int, int)
          1                             flop_count_hp     Floating Point Operations(Half Precision)           0           0           0
          1                             flop_count_sp   Floating Point Operations(Single Precision)           0           0           0
          1                        flop_sp_efficiency                  FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
          1                        flop_hp_efficiency                    FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
    Kernel: void cuInt8::nhwcTonchw<float, int=32, int=32, int=2>(__half const *, float*, int, int, int, int, int, int)
          3                             flop_count_hp     Floating Point Operations(Half Precision)       43191      690880      302267
          3                             flop_count_sp   Floating Point Operations(Single Precision)           0           0           0
          3                        flop_sp_efficiency                  FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
          3                        flop_hp_efficiency                    FLOP Efficiency(Peak Half)       0.15%       0.48%       0.32%
    Kernel: trt_volta_h884cudnn_128x128_ldg8_relu_exp_small_nhwc_tn_v1
         13                             flop_count_hp     Floating Point Operations(Half Precision)      835584     2959360     1162318
         13                             flop_count_sp   Floating Point Operations(Single Precision)           0           0           0
         13                        flop_sp_efficiency                  FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
         13                        flop_hp_efficiency                    FLOP Efficiency(Peak Half)       0.10%       0.45%       0.16%
    Kernel: trt_volta_h884cudnn_128x128_ldg8_relu_exp_medium_nhwc_tn_v1
          2                             flop_count_hp     Floating Point Operations(Half Precision)      835584     1531904     1183744
          2                             flop_count_sp   Floating Point Operations(Single Precision)           0           0           0
          2                        flop_sp_efficiency                  FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
          2                        flop_hp_efficiency                    FLOP Efficiency(Peak Half)       0.09%       0.64%       0.37%
    Kernel: trt_volta_h884cudnn_256x128_ldg8_relu_exp_interior_nhwc_tn_v1
          1                             flop_count_hp     Floating Point Operations(Half Precision)      417792      417792      417792
          1                             flop_count_sp   Floating Point Operations(Single Precision)           0           0           0
          1                        flop_sp_efficiency                  FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
          1                        flop_hp_efficiency                    FLOP Efficiency(Peak Half)       0.32%       0.32%       0.32%
    Kernel: void cuInt8::nchwTonhwc<__half, int=32, int=32, int=2>(__half const *, __half*, int, int, int, int, int, int, int, int)
          2                             flop_count_hp     Floating Point Operations(Half Precision)           0           0           0
          2                             flop_count_sp   Floating Point Operations(Single Precision)           0           0           0
          2                        flop_sp_efficiency                  FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
          2                        flop_hp_efficiency                    FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%

Kernels FP16 used

==1703== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   24.41%  18.679ms       150  124.53us  12.001us  653.66us  [CUDA memcpy DtoD]
                   22.22%  17.006ms         6  2.8343ms  1.2480us  16.758ms  [CUDA memcpy HtoD]
                   16.56%  12.674ms        25  506.96us  61.317us  2.3652ms  trt_volta_h884cudnn_256x128_ldg8_relu_exp_medium_nhwc_tn_v1
                    9.64%  7.3746ms        72  102.42us  6.3050us  1.4953ms  generatedNativePointwise
                    8.70%  6.6591ms        13  512.24us  360.67us  917.87us  trt_volta_h884cudnn_128x128_ldg8_relu_exp_small_nhwc_tn_v1
                    4.32%  3.3025ms         7  471.79us  401.89us  858.22us  trt_volta_h884cudnn_256x128_ldg8_relu_exp_small_nhwc_tn_v1
                    3.60%  2.7583ms        10  275.83us  66.534us  1.4785ms  trt_volta_h884cudnn_128x128_ldg8_relu_exp_interior_nhwc_tn_v1
                    2.00%  1.5334ms         1  1.5334ms  1.5334ms  1.5334ms  trt_volta_scudnn_128x32_relu_small_nn_v1
                    1.98%  1.5178ms         3  505.92us  236.60us  1.0432ms  trt_volta_h884cudnn_256x64_ldg8_relu_exp_small_nhwc_tn_v1
                    1.84%  1.4064ms         1  1.4064ms  1.4064ms  1.4064ms  trt_volta_h884cudnn_256x64_ldg8_relu_exp_medium_nhwc_tn_v1
                    1.49%  1.1377ms         1  1.1377ms  1.1377ms  1.1377ms  void cuInt8::nchwTonhwc<float, int=32, int=16, int=2>(float const *, __half*, int, int, int, int, int, int, int, int)
                    0.82%  630.74us         2  315.37us  125.93us  504.81us  trt_volta_h884cudnn_128x128_ldg8_relu_exp_medium_nhwc_tn_v1
                    0.74%  567.83us         9  63.091us  30.819us  71.206us  trt_volta_h884cudnn_256x64_sliced1x2_ldg8_relu_exp_interior_nhwc_tn_v1
                    0.63%  482.03us         1  482.03us  482.03us  482.03us  trt_volta_h884cudnn_256x64_ldg8_relu_exp_interior_nhwc_tn_v1
                    0.29%  218.20us         2  109.10us  74.438us  143.76us  void cuResizeLayer::ResizeNearestGenericKernel<__half, int=2, int=32>(__half*, cuResizeLayer::ResizeNearestGenericKernel<__half, int=2, int=32> const *, cuResizeLayer::LaunchParams)
                    0.22%  167.92us         3  55.973us  10.849us  125.07us  [CUDA memcpy DtoH]
                    0.19%  148.69us         2  74.342us  37.603us  111.08us  trt_volta_h884cudnn_256x64_sliced1x2_ldg8_relu_exp_small_nhwc_tn_v1
                    0.13%  97.063us         3  32.354us  8.8000us  66.757us  void cuInt8::nhwcTonchw<float, int=32, int=32, int=2>(__half const *, float*, int, int, int, int, int, int)
                    0.08%  63.590us         1  63.590us  63.590us  63.590us  trt_volta_h884cudnn_256x128_ldg8_relu_exp_interior_nhwc_tn_v1
                    0.08%  60.325us         2  30.162us  20.097us  40.228us  void cuInt8::nchwTonhwc<__half, int=32, int=32, int=2>(__half const *, __half*, int, int, int, int, int, int, int, int)
                    0.03%  20.226us         2  10.113us  8.7370us  11.489us  void cuInt8::nhwcTonchw<__half, int=32, int=32, int=2>(__half const *, __half*, int, int, int, int, int, int)
                    0.02%  19.011us         8  2.3760us  1.7920us  4.0010us  [CUDA memset]

And to calculate the FLOPS I multiplied the average flops of each kernel by the number of the call and did the sum to get the flops size : 500 MFlops (in FP16)

Environment

Jetpack Version: 4.5.1
Board : NVIDIA Jetson AGX Xavier
TensorRT Version: 7.1.3
GPU Type: Volta 512 CUDA Cores, 64 Tensor Cores
Nvidia Driver Version:
CUDA Version: 10.2
CUDNN Version: 8.0

Flops in FP32 = 50 GFlops

Flops in FP16 = 500 MFlops

The question is : why do i have this huge difference between FP32 and FP16 in size of operations despite of using the same algorithme (YOLOv3), is the using of TensorCore in the FP16 decreases the use of the number of the flotting operations ?

Thanks

Hi,

Could you share the exactly flop_count_hp and flop_count_sp number with us?
Do you add the flop_count_hp and flop_count_sp value to get the final FLOPS?

Thanks.

Hi @AastaLLL,

Thank you for your feedback, yes i added the flop_count_hp and flop_count_sp value to get the final FLOPS
and the flop_count_hp and flop_count_sp number of each kernel are in my post above

For FP32 mode that’s the final number of flops i got : 50559409362 flops
For FP16 mode that’s the final number of flops i got : 498780195 flops

That’s how i calculated the final number of flops in FP32 mode :

#Count the final number of flops in FP32 mode

def calculate():
#Number of flops_count_sp and flops_count_hp of each kernel in order	
	flops = [1043433216, 519168, 196497559, 268697600, 1600421888, 1662386176, 268697600, 321191936, 179978240,519168]
#Number of calls of each kernel
	coeff = [32,5,27,72,1,1,7,1,1,2]

	res = 0
	assert len(coeff) == len(flops)
	for f, c in zip(flops, coeff):
		res += f*c
	print("Number of Flotting Point Operation = {}".format(res))
	

calculate()

OUTPUT :

Number of Flotting Point Operation = 50559409362

That’s how i calculated the final number of flops in FP16 mode :

#Count the final number of flops in FP16 mode
def calculate():
#Number of flops_count_sp and flops_count_hp of each kernel in order	
    flops = [179978240,268697600,733685,1043433216,519168,1232869785,321191936,196497559,1662386176,1600421888,519168]
#Number of calls of each kernel
	coeff = [1,7,72,32,2,5,1,27,1,1,2]
	res = 0
	assert len(coeff) == len(flops)
	for f, c in zip(flops, coeff):
		res += f*c
	print("Number of Flotting Point Operation = {}".format(res))


calculate()

OUTPUT :

Number of Flotting Point Operation = 498780195

Thanks

Hi,

Sorry for the late update.
Could you separate the flop_count_hp and flop_count_sp and share the data with us?

For example:

FP32

#flop_count_sp = ?
#flop_count_hp = ?

FP16

#flop_count_sp = ?
#flop_count_hp = ?

Thanks.

1 Like

Hi @AastaLLL,

FP32

#flop_count_sp = 50559409362 flops
#flop_count_hp = 0 flops

FP16

#flop_count_sp = 40493448 flops
#flop_count_hp = 706028298 flops

The main difference between the two implementation is the use of TensorCores (h884cudnn) for convolution layers, should we multiply each number of flops by 884 for kernels used by TensorCore ?

Thanks

Hi,

Sorry for confusing you here.

If you want to compare the FLOPS between FP32 and FP16.
Please remember to divide the nvprof execution time.

For example, please calculate the FLOPS = flop_count_hp / time for each item.
And then summarize the score for each function to get the final FLOPS for FP32 and FP16.

Thanks.

Hi @AastaLLL,

I am looking for only for how much flops TensorRT need to execute YOLOv3 with both of the mode precision, because i thought that the number of GFlops of YOLOv3 is fixed to 66 GFlops and it will no change how ever the framwork we use, but in another side i read the TensorRT selects best and fastest kernels to do its optimizations, and i was wondering if the reason of the deascrising number of FLOPS is due to the chosen algorithmes ?
for example TensorRT use the winograd convolution which require less operations than a classic convolution

I am wondering if what am supposing is true or not ?

Thanks

Hi,

Half and Float mode use different algorithms since we have optimized the implementation.
The difference is also shown in the function name from nvprof.

Thanks.