How to calculate TOPS (INT8) or TFLOPS (FP16) of each layer of a CNN using TensorRT

Hi all,

I’ve used trtexec to generate a TensorRT engine (.trt) from an ONNX model YOLOv3-Tiny (yolov3-tiny.onnx), with profiling i get a report of the TensorRT YOLOv3-Tiny layers (after fusing/eliminating layers, choosing best kernel’s tactics, adding reformatting layer etc…), so i want to calculate the TOPS (INT8) or the TFLOPS (FP16) of each layers to have the sum of the TOPS when i execute my neural network with TRT runtime.

Is there an approach to calculate the TOPS when running an neural network of each layers or the TOPS of the whole neural networks ?

PS : I know that AGX Xavier SoC can use both of GPU (512 CUDA Cores + 64 TensorCores) which can delivers 22 TOPS or 22 TFLOS and the DLA Cores but what i am trying to do is to calculate the TOPS when i run the neural network.

Command line i used :

/usr/src/tensorrt/bin/trtexec --onnx=yolov3-tiny-416-bs16.onnx --best --workspace=2048 --saveEngine=yolov3-tiny-416-bs16.trt --calib=calib_yolov3-tiny-int8-416.bin --verbose --dumpProfile

Board : Jetson AGX Xavier
TensorRT version : 7.1.3
cuDNN : 8.0
CUDA : 10.2
JetPack version : 4.5.1

[06/10/2021-16:44:53] [I] === Profile (196 iterations ) ===
[06/10/2021-16:44:53] [I]                                        Layer   Time (ms)   Avg. Time (ms)   Time %
[06/10/2021-16:44:53] [I]        001_convolutional input reformatter 0       76.11             0.39      2.6
[06/10/2021-16:44:53] [I]                            001_convolutional      612.51             3.13     20.7
[06/10/2021-16:44:53] [I]                      001_convolutional_lrelu      232.50             1.19      7.9
[06/10/2021-16:44:53] [I]                                  002_maxpool      128.33             0.65      4.3
[06/10/2021-16:44:53] [I]                            003_convolutional      295.12             1.51     10.0
[06/10/2021-16:44:53] [I]                      003_convolutional_lrelu      109.19             0.56      3.7
[06/10/2021-16:44:53] [I]                                  004_maxpool       68.56             0.35      2.3
[06/10/2021-16:44:53] [I]                            005_convolutional      105.47             0.54      3.6
[06/10/2021-16:44:53] [I]                      005_convolutional_lrelu       55.56             0.28      1.9
[06/10/2021-16:44:53] [I]                                  006_maxpool       36.03             0.18      1.2
[06/10/2021-16:44:53] [I]                            007_convolutional       78.12             0.40      2.6
[06/10/2021-16:44:53] [I]                      007_convolutional_lrelu       28.68             0.15      1.0
[06/10/2021-16:44:53] [I]                                  008_maxpool       19.72             0.10      0.7
[06/10/2021-16:44:53] [I]                            009_convolutional       74.31             0.38      2.5
[06/10/2021-16:44:53] [I]                      009_convolutional_lrelu       16.70             0.09      0.6
[06/10/2021-16:44:53] [I]                                  010_maxpool       10.84             0.06      0.4
[06/10/2021-16:44:53] [I]                            011_convolutional       74.75             0.38      2.5
[06/10/2021-16:44:53] [I]                      011_convolutional_lrelu        9.46             0.05      0.3
[06/10/2021-16:44:53] [I]                                  012_maxpool       15.74             0.08      0.5
[06/10/2021-16:44:53] [I]                            013_convolutional      265.81             1.36      9.0
[06/10/2021-16:44:53] [I]                      013_convolutional_lrelu       17.18             0.09      0.6
[06/10/2021-16:44:53] [I]                            014_convolutional       20.95             0.11      0.7
[06/10/2021-16:44:53] [I]                      014_convolutional_lrelu        5.78             0.03      0.2
[06/10/2021-16:44:53] [I]                            019_convolutional        5.48             0.03      0.2
[06/10/2021-16:44:53] [I]                            015_convolutional       75.62             0.39      2.6
[06/10/2021-16:44:53] [I]  019_convolutional_lrelu input reformatter 0        3.12             0.02      0.1
[06/10/2021-16:44:53] [I]                      019_convolutional_lrelu        4.30             0.02      0.1
[06/10/2021-16:44:53] [I]                      015_convolutional_lrelu        9.37             0.05      0.3
[06/10/2021-16:44:53] [I]        016_convolutional input reformatter 0        9.28             0.05      0.3
[06/10/2021-16:44:53] [I]                            016_convolutional       26.17             0.13      0.9
[06/10/2021-16:44:53] [I]       016_convolutional output reformatter 0       11.55             0.06      0.4
[06/10/2021-16:44:53] [I]             020_upsample input reformatter 0        4.29             0.02      0.1
[06/10/2021-16:44:53] [I]                                 020_upsample       66.13             0.34      2.2
[06/10/2021-16:44:53] [I]                            020_upsample copy       62.05             0.32      2.1
[06/10/2021-16:44:53] [I]                            022_convolutional      196.06             1.00      6.6
[06/10/2021-16:44:53] [I]                      022_convolutional_lrelu       15.59             0.08      0.5
[06/10/2021-16:44:53] [I]        023_convolutional input reformatter 0       17.04             0.09      0.6
[06/10/2021-16:44:53] [I]                            023_convolutional       54.78             0.28      1.9
[06/10/2021-16:44:53] [I]       023_convolutional output reformatter 0       36.01             0.18      1.2
[06/10/2021-16:44:53] [I]                                        Total     2954.25            15.07    100.0
[06/10/2021-16:44:53] [I] 
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=yolov3-tiny-416-bs16.onnx --best --workspace=2048 --saveEngine=yolov3-tiny-416-bs16.trt --calib=calib_yolov3-tiny-int8-416.bin --verbose --dumpProfile

Thank you !

Hi,

trtexec can only measure the elapsed time.
If you want a low-level execution information, please use our profiling tool below:

https://developer.nvidia.com/nsight-compute

Thanks.

Hi @AastaLLL,

Can i use this tool if i have only the Jetson XAVIER ? or a host with an NVIDIA GPU is essential to profile the performance of the Jetson GPU ?

Thanks

@AastaLLL

I am getting this error in the NIVIDIA Nsight Compute 2021.1.1 tool when i execute my command line from the tool in the host, is the version of the Nsight Compute compatible with Jetson platforms ? or should i install Nsight Compute 2019 ?

The command i used

/usr/src/tensorrt/bin/trtexec --onnx=yolov3-tiny-416-bs16.onnx --best --workspace=2048 --saveEngine=yolov3-tiny-416-bs16.trt --calib=calib_yolov3-tiny-int8-416.bin --verbose --dumpProfile

am i missing something ?

Errors:

==PROF== Disconnected from process 1388
7
==ERROR== The application returned an error code (1).

==ERROR== An error occurred while trying to profile.

==WARNING== No kernels were profiled.

==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

Launched application returned 1 (0x1).

Hi,

Nsight compute need to be launched on a desktop environment.
But you don’t need a dGPU if just profiling the Xavier.

Currently, the compatible version is 2019.05.
Please install it with the host CUDA package from the JetPack installer.

Thanks.

Hi @AastaLLL,

I used nvprof to calculate the number of flops when i run YOLOv3 with TensorRT in two different precision mode (FP32 and FP16 mode), and the number of flops was too different :

For FP32 mode that’s the final number of flops i got : 50559409362 flops
For FP16 mode that’s the final number of flops i got : 498780195 flops

why do i have this huge difference between FP32 and FP16 in size of operations despite of using the same algorithme (YOLOv3), is the using of TensorCore in the FP16 decreases the use of the number of the flotting operations ?

Thanks

Let’s check this difference issue in the below topic:

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.