How to calculate TOPS (INT8) or TFLOPS (FP16) of each layer of a CNN using TensorRT

chakibdace · June 15, 2021, 1:36pm

Hi all,

I’ve used trtexec to generate a TensorRT engine (.trt) from an ONNX model YOLOv3-Tiny (yolov3-tiny.onnx), with profiling i get a report of the TensorRT YOLOv3-Tiny layers (after fusing/eliminating layers, choosing best kernel’s tactics, adding reformatting layer etc…), so i want to calculate the TOPS (INT8) or the TFLOPS (FP16) of each layers to have the sum of the TOPS when i execute my neural network with TRT runtime.

Is there an approach to calculate the TOPS when running an neural network of each layers or the TOPS of the whole neural networks ?

PS : I know that AGX Xavier SoC can use both of GPU (512 CUDA Cores + 64 TensorCores) which can delivers 22 TOPS or 22 TFLOS and the DLA Cores but what i am trying to do is to calculate the TOPS when i run the neural network.

Command line i used :

/usr/src/tensorrt/bin/trtexec --onnx=yolov3-tiny-416-bs16.onnx --best --workspace=2048 --saveEngine=yolov3-tiny-416-bs16.trt --calib=calib_yolov3-tiny-int8-416.bin --verbose --dumpProfile

Board : Jetson AGX Xavier
TensorRT version : 7.1.3
cuDNN : 8.0
CUDA : 10.2
JetPack version : 4.5.1

[06/10/2021-16:44:53] [I] === Profile (196 iterations ) ===
[06/10/2021-16:44:53] [I]                                        Layer   Time (ms)   Avg. Time (ms)   Time %
[06/10/2021-16:44:53] [I]        001_convolutional input reformatter 0       76.11             0.39      2.6
[06/10/2021-16:44:53] [I]                            001_convolutional      612.51             3.13     20.7
[06/10/2021-16:44:53] [I]                      001_convolutional_lrelu      232.50             1.19      7.9
[06/10/2021-16:44:53] [I]                                  002_maxpool      128.33             0.65      4.3
[06/10/2021-16:44:53] [I]                            003_convolutional      295.12             1.51     10.0
[06/10/2021-16:44:53] [I]                      003_convolutional_lrelu      109.19             0.56      3.7
[06/10/2021-16:44:53] [I]                                  004_maxpool       68.56             0.35      2.3
[06/10/2021-16:44:53] [I]                            005_convolutional      105.47             0.54      3.6
[06/10/2021-16:44:53] [I]                      005_convolutional_lrelu       55.56             0.28      1.9
[06/10/2021-16:44:53] [I]                                  006_maxpool       36.03             0.18      1.2
[06/10/2021-16:44:53] [I]                            007_convolutional       78.12             0.40      2.6
[06/10/2021-16:44:53] [I]                      007_convolutional_lrelu       28.68             0.15      1.0
[06/10/2021-16:44:53] [I]                                  008_maxpool       19.72             0.10      0.7
[06/10/2021-16:44:53] [I]                            009_convolutional       74.31             0.38      2.5
[06/10/2021-16:44:53] [I]                      009_convolutional_lrelu       16.70             0.09      0.6
[06/10/2021-16:44:53] [I]                                  010_maxpool       10.84             0.06      0.4
[06/10/2021-16:44:53] [I]                            011_convolutional       74.75             0.38      2.5
[06/10/2021-16:44:53] [I]                      011_convolutional_lrelu        9.46             0.05      0.3
[06/10/2021-16:44:53] [I]                                  012_maxpool       15.74             0.08      0.5
[06/10/2021-16:44:53] [I]                            013_convolutional      265.81             1.36      9.0
[06/10/2021-16:44:53] [I]                      013_convolutional_lrelu       17.18             0.09      0.6
[06/10/2021-16:44:53] [I]                            014_convolutional       20.95             0.11      0.7
[06/10/2021-16:44:53] [I]                      014_convolutional_lrelu        5.78             0.03      0.2
[06/10/2021-16:44:53] [I]                            019_convolutional        5.48             0.03      0.2
[06/10/2021-16:44:53] [I]                            015_convolutional       75.62             0.39      2.6
[06/10/2021-16:44:53] [I]  019_convolutional_lrelu input reformatter 0        3.12             0.02      0.1
[06/10/2021-16:44:53] [I]                      019_convolutional_lrelu        4.30             0.02      0.1
[06/10/2021-16:44:53] [I]                      015_convolutional_lrelu        9.37             0.05      0.3
[06/10/2021-16:44:53] [I]        016_convolutional input reformatter 0        9.28             0.05      0.3
[06/10/2021-16:44:53] [I]                            016_convolutional       26.17             0.13      0.9
[06/10/2021-16:44:53] [I]       016_convolutional output reformatter 0       11.55             0.06      0.4
[06/10/2021-16:44:53] [I]             020_upsample input reformatter 0        4.29             0.02      0.1
[06/10/2021-16:44:53] [I]                                 020_upsample       66.13             0.34      2.2
[06/10/2021-16:44:53] [I]                            020_upsample copy       62.05             0.32      2.1
[06/10/2021-16:44:53] [I]                            022_convolutional      196.06             1.00      6.6
[06/10/2021-16:44:53] [I]                      022_convolutional_lrelu       15.59             0.08      0.5
[06/10/2021-16:44:53] [I]        023_convolutional input reformatter 0       17.04             0.09      0.6
[06/10/2021-16:44:53] [I]                            023_convolutional       54.78             0.28      1.9
[06/10/2021-16:44:53] [I]       023_convolutional output reformatter 0       36.01             0.18      1.2
[06/10/2021-16:44:53] [I]                                        Total     2954.25            15.07    100.0
[06/10/2021-16:44:53] [I] 
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=yolov3-tiny-416-bs16.onnx --best --workspace=2048 --saveEngine=yolov3-tiny-416-bs16.trt --calib=calib_yolov3-tiny-int8-416.bin --verbose --dumpProfile

Thank you !

AastaLLL · June 16, 2021, 3:12am

Hi,

trtexec can only measure the elapsed time.
If you want a low-level execution information, please use our profiling tool below:

https://developer.nvidia.com/nsight-compute

Thanks.

chakibdace · June 16, 2021, 8:53am

Hi @AastaLLL,

Can i use this tool if i have only the Jetson XAVIER ? or a host with an NVIDIA GPU is essential to profile the performance of the Jetson GPU ?

Thanks

chakibdace · June 16, 2021, 3:31pm

@AastaLLL

I am getting this error in the NIVIDIA Nsight Compute 2021.1.1 tool when i execute my command line from the tool in the host, is the version of the Nsight Compute compatible with Jetson platforms ? or should i install Nsight Compute 2019 ?

The command i used

/usr/src/tensorrt/bin/trtexec --onnx=yolov3-tiny-416-bs16.onnx --best --workspace=2048 --saveEngine=yolov3-tiny-416-bs16.trt --calib=calib_yolov3-tiny-int8-416.bin --verbose --dumpProfile

am i missing something ?

Errors:

==PROF== Disconnected from process 1388
7
==ERROR== The application returned an error code (1).

==ERROR== An error occurred while trying to profile.

==WARNING== No kernels were profiled.

==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

Launched application returned 1 (0x1).

AastaLLL · June 29, 2021, 8:33am

Hi,

Nsight compute need to be launched on a desktop environment.
But you don’t need a dGPU if just profiling the Xavier.

Currently, the compatible version is 2019.05.
Please install it with the host CUDA package from the JetPack installer.

Thanks.

chakibdace · June 29, 2021, 11:14am

Hi @AastaLLL,

I used nvprof to calculate the number of flops when i run YOLOv3 with TensorRT in two different precision mode (FP32 and FP16 mode), and the number of flops was too different :

For FP32 mode that’s the final number of flops i got : 50559409362 flops
For FP16 mode that’s the final number of flops i got : 498780195 flops

why do i have this huge difference between FP32 and FP16 in size of operations despite of using the same algorithme (YOLOv3), is the using of TensorCore in the FP16 decreases the use of the number of the flotting operations ?

Thanks

AastaLLL · July 6, 2021, 6:48am

Let’s check this difference issue in the below topic:

system · September 12, 2021, 2:26am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why the number of flops is different between FP32 and FP16 mode with YOLOv3 TensorRT implementation? Jetson AGX Xavier tensorrt , kernel , profiling	8	3852	October 18, 2021
Trt model inference difference Jetson AGX Xavier jetson-inference , cudnn	4	27	November 5, 2024
Difference between running the inference with trtexec and tensorrt python API Jetson AGX Xavier tensorrt , python	4	2954	October 18, 2021
TensorRT INT8 calibration python API Jetson AGX Orin tensorrt , jetson-inference , python , calibration	28	5310	October 26, 2022
Different FP16 inference with tensorrt and pytorch TensorRT	5	4408	October 25, 2021
Yolov5 slow inference on Jetson Xavier NX16 Jetson Xavier NX ai	10	1521	October 26, 2022
Jetson Nano Python 3.7 version for Tensorrt Jetson Nano tensorrt , python	14	3724	April 12, 2023
TensorRT YOLO inference error Jetson TX1	21	12397	October 18, 2021
Yolov3 FPS on TensorRT Jetson AGX Xavier tensorrt	26	7142	October 18, 2021
Calibration failed: INTERNAL: Failed to build TensorRT engine (INT8 precision mode) in Jetson Xavier NX (16GB) Jetson Xavier NX tensorrt	9	745	April 12, 2023

How to calculate TOPS (INT8) or TFLOPS (FP16) of each layer of a CNN using TensorRT

Related topics