Trt model inference difference

CeFf1z · October 30, 2024, 6:11am

Module:NVIDIA Jetson AGX Xavier (32 GB ram)
CUDA : 11.4.239
cuDNN:8.4.1.50
TensorRT:8.4.1.5
Jetpack:5.0.2

I try to use trtexec to transfer a YOLOv8 onnx model to TRT engine model, using DLA for inference.
A experiment refers to https://forums.developer.nvidia.com/t/dla-performance/302939

I donwload the onnx model “yolov8s_1400_512_bs1.onnx” from upper url.

In the terminal:
$ root@miivii-tegra:/home/nvidia/workspace/v8/nvidia_example# trtexec --onnx=yolov8s_1400_512_bs1.onnx --int8 --fp16 --best --useDLACore=1 --allowGPUFallback --saveEngine=yolov8s_dla_b1_int8.engine --verbose > test.log
test.log (1.8 MB)

$ root@miivii-tegra:/home/nvidia/workspace/v8/nvidia_example# trtexec --loadEngine=yolov8s_dla_b1_int8.engine --useDLACore=1 --dumpProfile

&&&& RUNNING TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=yolov8s_dla_b1_int8.engine --useDLACore=1 --dumpProfile
[10/30/2024-13:47:22] [I] === Model Options ===
[10/30/2024-13:47:22] [I] Format: *
[10/30/2024-13:47:22] [I] Model:
[10/30/2024-13:47:22] [I] Output:
[10/30/2024-13:47:22] [I] === Build Options ===
[10/30/2024-13:47:22] [I] Max batch: 1
[10/30/2024-13:47:22] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/30/2024-13:47:22] [I] minTiming: 1
[10/30/2024-13:47:22] [I] avgTiming: 8
[10/30/2024-13:47:22] [I] Precision: FP32
[10/30/2024-13:47:22] [I] LayerPrecisions:
[10/30/2024-13:47:22] [I] Calibration:
[10/30/2024-13:47:22] [I] Refit: Disabled
[10/30/2024-13:47:22] [I] Sparsity: Disabled
[10/30/2024-13:47:22] [I] Safe mode: Disabled
[10/30/2024-13:47:22] [I] DirectIO mode: Disabled
[10/30/2024-13:47:22] [I] Restricted mode: Disabled
[10/30/2024-13:47:22] [I] Build only: Disabled
[10/30/2024-13:47:22] [I] Save engine:
[10/30/2024-13:47:22] [I] Load engine: yolov8s_dla_b1_int8.engine
[10/30/2024-13:47:22] [I] Profiling verbosity: 0
[10/30/2024-13:47:22] [I] Tactic sources: Using default tactic sources
[10/30/2024-13:47:22] [I] timingCacheMode: local
[10/30/2024-13:47:22] [I] timingCacheFile:
[10/30/2024-13:47:22] [I] Input(s)s format: fp32:CHW
[10/30/2024-13:47:22] [I] Output(s)s format: fp32:CHW
[10/30/2024-13:47:22] [I] Input build shapes: model
[10/30/2024-13:47:22] [I] Input calibration shapes: model
[10/30/2024-13:47:22] [I] === System Options ===
[10/30/2024-13:47:22] [I] Device: 0
[10/30/2024-13:47:22] [I] DLACore: 1
[10/30/2024-13:47:22] [I] Plugins:
[10/30/2024-13:47:22] [I] === Inference Options ===
[10/30/2024-13:47:22] [I] Batch: 1
[10/30/2024-13:47:22] [I] Input inference shapes: model
[10/30/2024-13:47:22] [I] Iterations: 10
[10/30/2024-13:47:22] [I] Duration: 3s (+ 200ms warm up)
[10/30/2024-13:47:22] [I] Sleep time: 0ms
[10/30/2024-13:47:22] [I] Idle time: 0ms
[10/30/2024-13:47:22] [I] Streams: 1
[10/30/2024-13:47:22] [I] ExposeDMA: Disabled
[10/30/2024-13:47:22] [I] Data transfers: Enabled
[10/30/2024-13:47:22] [I] Spin-wait: Disabled
[10/30/2024-13:47:22] [I] Multithreading: Disabled
[10/30/2024-13:47:22] [I] CUDA Graph: Disabled
[10/30/2024-13:47:22] [I] Separate profiling: Disabled
[10/30/2024-13:47:22] [I] Time Deserialize: Disabled
[10/30/2024-13:47:22] [I] Time Refit: Disabled
[10/30/2024-13:47:22] [I] Inputs:
[10/30/2024-13:47:22] [I] === Reporting Options ===
[10/30/2024-13:47:22] [I] Verbose: Disabled
[10/30/2024-13:47:22] [I] Averages: 10 inferences
[10/30/2024-13:47:22] [I] Percentile: 99
[10/30/2024-13:47:22] [I] Dump refittable layers:Disabled
[10/30/2024-13:47:22] [I] Dump output: Disabled
[10/30/2024-13:47:22] [I] Profile: Enabled
[10/30/2024-13:47:22] [I] Export timing to JSON file:
[10/30/2024-13:47:22] [I] Export output to JSON file:
[10/30/2024-13:47:22] [I] Export profile to JSON file:
[10/30/2024-13:47:22] [I]
[10/30/2024-13:47:22] [I] === Device Information ===
[10/30/2024-13:47:22] [I] Selected Device: Xavier
[10/30/2024-13:47:22] [I] Compute Capability: 7.2
[10/30/2024-13:47:22] [I] SMs: 8
[10/30/2024-13:47:22] [I] Compute Clock Rate: 1.377 GHz
[10/30/2024-13:47:22] [I] Device Global Memory: 31009 MiB
[10/30/2024-13:47:22] [I] Shared Memory per SM: 96 KiB
[10/30/2024-13:47:22] [I] Memory Bus Width: 256 bits (ECC disabled)
[10/30/2024-13:47:22] [I] Memory Clock Rate: 1.377 GHz
[10/30/2024-13:47:22] [I]
[10/30/2024-13:47:22] [I] TensorRT version: 8.4.1
[10/30/2024-13:47:22] [I] Engine loaded in 0.0142044 sec.
[10/30/2024-13:47:23] [I] [TRT] [MemUsageChange] Init CUDA: CPU +185, GPU +0, now: CPU 221, GPU 8284 (MiB)
[10/30/2024-13:47:23] [I] [TRT] Loaded engine size: 12 MiB
[10/30/2024-13:47:23] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +11, GPU +0, now: CPU 11, GPU 0 (MiB)
[10/30/2024-13:47:23] [I] Engine deserialized in 0.979467 sec.
[10/30/2024-13:47:23] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +14, now: CPU 11, GPU 14 (MiB)
[10/30/2024-13:47:23] [I] Using random values for input images
[10/30/2024-13:47:23] [I] Created input binding for images with dimensions 1x3x1408x512
[10/30/2024-13:47:23] [I] Using random values for output output0
[10/30/2024-13:47:23] [I] Created output binding for output0 with dimensions 1x84x14784
[10/30/2024-13:47:23] [I] Starting inference
[10/30/2024-13:47:27] [I] The e2e network timing is not reported since it is inaccurate due to the extra synchronizations when the profiler is enabled.
[10/30/2024-13:47:27] [I] To show e2e network timing report, add --separateProfileRun to profile layer timing in a separate run or remove --dumpProfile to disable the profiler.
[10/30/2024-13:47:27] [I]
[10/30/2024-13:47:27] [I] === Profile (46 iterations ) ===
[10/30/2024-13:47:27] [I] Layer Time (ms) Avg. Time (ms) Median Time (ms) Time %
[10/30/2024-13:47:27] [I] images to nvm 9.16 0.1991 0.1988 0.3
[10/30/2024-13:47:27] [I] {ForeignNode[/model.0/conv/Conv…/model.22/Concat_2]} 11.00 0.2392 0.1850 0.3
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Output Tensor 2 to {ForeignNode[/model.0/conv/Conv…/model.22/Concat_2]} 3295.70 71.6457 71.5757 98.1
[10/30/2024-13:47:27] [I] images copy finish 0.11 0.0024 0.0023 0.0
[10/30/2024-13:47:27] [I] Reformatted Output Tensor 2 to {ForeignNode[/model.0/conv/Conv…/model.22/Concat_2]} finish 0.11 0.0023 0.0023 0.0
[10/30/2024-13:47:27] [I] /model.22/Reshape 8.14 0.1769 0.1763 0.2
[10/30/2024-13:47:27] [I] /model.22/Concat_output_0 finish 0.17 0.0037 0.0036 0.0
[10/30/2024-13:47:27] [I] /model.22/Reshape_copy_output 3.88 0.0844 0.0842 0.1
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/Reshape_1 1.88 0.0408 0.0407 0.1
[10/30/2024-13:47:27] [I] /model.22/Concat_1_output_0 finish 0.18 0.0038 0.0038 0.0
[10/30/2024-13:47:27] [I] /model.22/Reshape_1_copy_output 1.29 0.0280 0.0279 0.0
[10/30/2024-13:47:27] [I] /model.22/Reshape_2_copy_output 0.73 0.0159 0.0156 0.0
[10/30/2024-13:47:27] [I] /model.22/dfl/Reshape + /model.22/dfl/Transpose 4.52 0.0982 0.0979 0.1
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/dfl/Softmax 3.34 0.0725 0.0722 0.1
[10/30/2024-13:47:27] [I] /model.22/dfl/Softmax 2.34 0.0510 0.0510 0.1
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/dfl/conv/Conv 3.27 0.0710 0.0709 0.1
[10/30/2024-13:47:27] [I] /model.22/dfl/conv/Conv 2.75 0.0599 0.0599 0.1
[10/30/2024-13:47:27] [I] /model.22/dfl/Reshape_1 0.46 0.0099 0.0098 0.0
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 1 to scale_eltwise_of_/model.22/Sub 0.43 0.0093 0.0092 0.0
[10/30/2024-13:47:27] [I] scale_eltwise_of_/model.22/Sub 0.56 0.0122 0.0121 0.0
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/Sub 0.36 0.0078 0.0078 0.0
[10/30/2024-13:47:27] [I] /model.22/Sub 0.59 0.0128 0.0127 0.0
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/Add_1 0.28 0.0062 0.0061 0.0
[10/30/2024-13:47:27] [I] /model.22/Add_1 0.45 0.0099 0.0097 0.0
[10/30/2024-13:47:27] [I] /model.22/Add_2 0.40 0.0086 0.0086 0.0
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 1 to scale_eltwise_of_/model.22/Sub_1 0.34 0.0075 0.0072 0.0
[10/30/2024-13:47:27] [I] scale_eltwise_of_/model.22/Sub_1 0.41 0.0089 0.0087 0.0
[10/30/2024-13:47:27] [I] /model.22/Sub_1 0.41 0.0089 0.0087 0.0
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Output Tensor 0 to /model.22/Sub_1 0.41 0.0089 0.0089 0.0
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/Div_1 0.33 0.0072 0.0071 0.0
[10/30/2024-13:47:27] [I] /model.22/Div_1 0.35 0.0076 0.0076 0.0
[10/30/2024-13:47:27] [I] /model.22/Div_1_output_0 copy 0.30 0.0066 0.0065 0.0
[10/30/2024-13:47:27] [I] /model.22/Mul_2 0.57 0.0124 0.0122 0.0
[10/30/2024-13:47:27] [I] PWN(/model.22/Sigmoid) 4.18 0.0909 0.0903 0.1
[10/30/2024-13:47:27] [I] Total 3359.38 73.0301 72.9173 100.0
[10/30/2024-13:47:27] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=yolov8s_dla_b1_int8.engine --useDLACore=1 --dumpProfile

I wonder why there is a layer that Reformatting CopyNode for Output Tensor 2 to {ForeignNode[/model.0/conv/Conv.../model.22/Concat_2]}, which costs most part of time.
I have tried several YOLO models and find the same situation like this. Is it a problem with memory or the device i use now.

I am sincerely looking forward your advice and provide more information if needed as soon as possible.

CeFf1z · October 30, 2024, 6:34am

By the way, I use nsys to test engine model:

root@miivii-tegra:/home/nvidia/workspace/v8/nvidia_example# nsys profile --trace=cuda,nvtx,cublas,cudla,cusparse,cudnn,nvmedia --output=dla /usr/src/tensorrt/bin/trtexec
 --loadEngine=yolov8s_dla_b1_int8.engine --iterations=10 --idleTime=500 --duration=0 --useSpinWait

I put the result .nsys-rep file into Nsight Systems.

I choose the last iteration result and find that the whole inference time is about 4.496ms.

The specific layer Reformatting CopyNode for Output Tensor 2 to {ForeignNode[/model.0/conv/Conv.../model.22/Concat_2]} just costs 0.219ms in the pic.

Why there is the difference between trtexec and nsys inference result.

SivaRamaKrishnaNV · November 4, 2024, 2:43pm

Please see What does “Reformatting CopyNode for Input Tensor” mean in trtexec' dump profile · Issue #2136 · NVIDIA/TensorRT · GitHub if it helps to clarify. This involves input data format change that is fed into DLA.

nsys profiler expect to have some additional overhead as it insert additional hooks for profile/trace and cant be directly compared with trtexec timings.

CeFf1z · November 5, 2024, 7:41am

Thanks for your answer. I will compare the difference.

system · December 3, 2024, 1:46pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Trying to convert Yolov8.onnx into trt ( TensorRT version : 8.2, jetson-jetpack : 4.6.1) Jetson Xavier NX tensorrt , cuda , yolo	12	3422	May 17, 2023
Issues while converting ONNX to TRT Jetson Nano tensorrt , onnx	9	1275	October 18, 2021
DLA performance DeepStream SDK	17	150	September 23, 2024
Model onnx trt engine generation process report different results compared between PC and jetson XAVIER NX Jetson Xavier NX tensorrt	19	1022	September 28, 2022
ERORR with ONNX2TRT : Unknown embedded device detected Jetson Xavier NX onnx	18	4575	April 27, 2022
Erorr with onnx to trt Jetson Xavier NX tensorrt	8	1249	March 30, 2022
Error loading .trt model Jetson AGX Orin tensorrt	7	152	November 6, 2024
Tensor RT optimization causes performance downgrade compared to onnx model TensorRT	4	880	January 26, 2022
Using trtexec fails to convert onnx to tensorrt engine (DLAcore) FP16, but int8 works Jetson Xavier NX dla	7	1305	August 10, 2022
DLA:Using dla on orin nx meet an error Jetson Orin NX dla	8	56	November 12, 2024

Trt model inference difference

Related topics