Trt model inference difference

Module:NVIDIA Jetson AGX Xavier (32 GB ram)
CUDA : 11.4.239
cuDNN:8.4.1.50
TensorRT:8.4.1.5
Jetpack:5.0.2

I try to use trtexec to transfer a YOLOv8 onnx model to TRT engine model, using DLA for inference.
A experiment refers to https://forums.developer.nvidia.com/t/dla-performance/302939

I donwload the onnx model “yolov8s_1400_512_bs1.onnx” from upper url.

In the terminal:
$ root@miivii-tegra:/home/nvidia/workspace/v8/nvidia_example# trtexec --onnx=yolov8s_1400_512_bs1.onnx --int8 --fp16 --best --useDLACore=1 --allowGPUFallback --saveEngine=yolov8s_dla_b1_int8.engine --verbose > test.log
test.log (1.8 MB)

$ root@miivii-tegra:/home/nvidia/workspace/v8/nvidia_example# trtexec --loadEngine=yolov8s_dla_b1_int8.engine --useDLACore=1 --dumpProfile

&&&& RUNNING TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=yolov8s_dla_b1_int8.engine --useDLACore=1 --dumpProfile
[10/30/2024-13:47:22] [I] === Model Options ===
[10/30/2024-13:47:22] [I] Format: *
[10/30/2024-13:47:22] [I] Model:
[10/30/2024-13:47:22] [I] Output:
[10/30/2024-13:47:22] [I] === Build Options ===
[10/30/2024-13:47:22] [I] Max batch: 1
[10/30/2024-13:47:22] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/30/2024-13:47:22] [I] minTiming: 1
[10/30/2024-13:47:22] [I] avgTiming: 8
[10/30/2024-13:47:22] [I] Precision: FP32
[10/30/2024-13:47:22] [I] LayerPrecisions:
[10/30/2024-13:47:22] [I] Calibration:
[10/30/2024-13:47:22] [I] Refit: Disabled
[10/30/2024-13:47:22] [I] Sparsity: Disabled
[10/30/2024-13:47:22] [I] Safe mode: Disabled
[10/30/2024-13:47:22] [I] DirectIO mode: Disabled
[10/30/2024-13:47:22] [I] Restricted mode: Disabled
[10/30/2024-13:47:22] [I] Build only: Disabled
[10/30/2024-13:47:22] [I] Save engine:
[10/30/2024-13:47:22] [I] Load engine: yolov8s_dla_b1_int8.engine
[10/30/2024-13:47:22] [I] Profiling verbosity: 0
[10/30/2024-13:47:22] [I] Tactic sources: Using default tactic sources
[10/30/2024-13:47:22] [I] timingCacheMode: local
[10/30/2024-13:47:22] [I] timingCacheFile:
[10/30/2024-13:47:22] [I] Input(s)s format: fp32:CHW
[10/30/2024-13:47:22] [I] Output(s)s format: fp32:CHW
[10/30/2024-13:47:22] [I] Input build shapes: model
[10/30/2024-13:47:22] [I] Input calibration shapes: model
[10/30/2024-13:47:22] [I] === System Options ===
[10/30/2024-13:47:22] [I] Device: 0
[10/30/2024-13:47:22] [I] DLACore: 1
[10/30/2024-13:47:22] [I] Plugins:
[10/30/2024-13:47:22] [I] === Inference Options ===
[10/30/2024-13:47:22] [I] Batch: 1
[10/30/2024-13:47:22] [I] Input inference shapes: model
[10/30/2024-13:47:22] [I] Iterations: 10
[10/30/2024-13:47:22] [I] Duration: 3s (+ 200ms warm up)
[10/30/2024-13:47:22] [I] Sleep time: 0ms
[10/30/2024-13:47:22] [I] Idle time: 0ms
[10/30/2024-13:47:22] [I] Streams: 1
[10/30/2024-13:47:22] [I] ExposeDMA: Disabled
[10/30/2024-13:47:22] [I] Data transfers: Enabled
[10/30/2024-13:47:22] [I] Spin-wait: Disabled
[10/30/2024-13:47:22] [I] Multithreading: Disabled
[10/30/2024-13:47:22] [I] CUDA Graph: Disabled
[10/30/2024-13:47:22] [I] Separate profiling: Disabled
[10/30/2024-13:47:22] [I] Time Deserialize: Disabled
[10/30/2024-13:47:22] [I] Time Refit: Disabled
[10/30/2024-13:47:22] [I] Inputs:
[10/30/2024-13:47:22] [I] === Reporting Options ===
[10/30/2024-13:47:22] [I] Verbose: Disabled
[10/30/2024-13:47:22] [I] Averages: 10 inferences
[10/30/2024-13:47:22] [I] Percentile: 99
[10/30/2024-13:47:22] [I] Dump refittable layers:Disabled
[10/30/2024-13:47:22] [I] Dump output: Disabled
[10/30/2024-13:47:22] [I] Profile: Enabled
[10/30/2024-13:47:22] [I] Export timing to JSON file:
[10/30/2024-13:47:22] [I] Export output to JSON file:
[10/30/2024-13:47:22] [I] Export profile to JSON file:
[10/30/2024-13:47:22] [I]
[10/30/2024-13:47:22] [I] === Device Information ===
[10/30/2024-13:47:22] [I] Selected Device: Xavier
[10/30/2024-13:47:22] [I] Compute Capability: 7.2
[10/30/2024-13:47:22] [I] SMs: 8
[10/30/2024-13:47:22] [I] Compute Clock Rate: 1.377 GHz
[10/30/2024-13:47:22] [I] Device Global Memory: 31009 MiB
[10/30/2024-13:47:22] [I] Shared Memory per SM: 96 KiB
[10/30/2024-13:47:22] [I] Memory Bus Width: 256 bits (ECC disabled)
[10/30/2024-13:47:22] [I] Memory Clock Rate: 1.377 GHz
[10/30/2024-13:47:22] [I]
[10/30/2024-13:47:22] [I] TensorRT version: 8.4.1
[10/30/2024-13:47:22] [I] Engine loaded in 0.0142044 sec.
[10/30/2024-13:47:23] [I] [TRT] [MemUsageChange] Init CUDA: CPU +185, GPU +0, now: CPU 221, GPU 8284 (MiB)
[10/30/2024-13:47:23] [I] [TRT] Loaded engine size: 12 MiB
[10/30/2024-13:47:23] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +11, GPU +0, now: CPU 11, GPU 0 (MiB)
[10/30/2024-13:47:23] [I] Engine deserialized in 0.979467 sec.
[10/30/2024-13:47:23] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +14, now: CPU 11, GPU 14 (MiB)
[10/30/2024-13:47:23] [I] Using random values for input images
[10/30/2024-13:47:23] [I] Created input binding for images with dimensions 1x3x1408x512
[10/30/2024-13:47:23] [I] Using random values for output output0
[10/30/2024-13:47:23] [I] Created output binding for output0 with dimensions 1x84x14784
[10/30/2024-13:47:23] [I] Starting inference
[10/30/2024-13:47:27] [I] The e2e network timing is not reported since it is inaccurate due to the extra synchronizations when the profiler is enabled.
[10/30/2024-13:47:27] [I] To show e2e network timing report, add --separateProfileRun to profile layer timing in a separate run or remove --dumpProfile to disable the profiler.
[10/30/2024-13:47:27] [I]
[10/30/2024-13:47:27] [I] === Profile (46 iterations ) ===
[10/30/2024-13:47:27] [I] Layer Time (ms) Avg. Time (ms) Median Time (ms) Time %
[10/30/2024-13:47:27] [I] images to nvm 9.16 0.1991 0.1988 0.3
[10/30/2024-13:47:27] [I] {ForeignNode[/model.0/conv/Conv…/model.22/Concat_2]} 11.00 0.2392 0.1850 0.3
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Output Tensor 2 to {ForeignNode[/model.0/conv/Conv…/model.22/Concat_2]} 3295.70 71.6457 71.5757 98.1
[10/30/2024-13:47:27] [I] images copy finish 0.11 0.0024 0.0023 0.0
[10/30/2024-13:47:27] [I] Reformatted Output Tensor 2 to {ForeignNode[/model.0/conv/Conv…/model.22/Concat_2]} finish 0.11 0.0023 0.0023 0.0
[10/30/2024-13:47:27] [I] /model.22/Reshape 8.14 0.1769 0.1763 0.2
[10/30/2024-13:47:27] [I] /model.22/Concat_output_0 finish 0.17 0.0037 0.0036 0.0
[10/30/2024-13:47:27] [I] /model.22/Reshape_copy_output 3.88 0.0844 0.0842 0.1
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/Reshape_1 1.88 0.0408 0.0407 0.1
[10/30/2024-13:47:27] [I] /model.22/Concat_1_output_0 finish 0.18 0.0038 0.0038 0.0
[10/30/2024-13:47:27] [I] /model.22/Reshape_1_copy_output 1.29 0.0280 0.0279 0.0
[10/30/2024-13:47:27] [I] /model.22/Reshape_2_copy_output 0.73 0.0159 0.0156 0.0
[10/30/2024-13:47:27] [I] /model.22/dfl/Reshape + /model.22/dfl/Transpose 4.52 0.0982 0.0979 0.1
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/dfl/Softmax 3.34 0.0725 0.0722 0.1
[10/30/2024-13:47:27] [I] /model.22/dfl/Softmax 2.34 0.0510 0.0510 0.1
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/dfl/conv/Conv 3.27 0.0710 0.0709 0.1
[10/30/2024-13:47:27] [I] /model.22/dfl/conv/Conv 2.75 0.0599 0.0599 0.1
[10/30/2024-13:47:27] [I] /model.22/dfl/Reshape_1 0.46 0.0099 0.0098 0.0
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 1 to scale_eltwise_of_/model.22/Sub 0.43 0.0093 0.0092 0.0
[10/30/2024-13:47:27] [I] scale_eltwise_of_/model.22/Sub 0.56 0.0122 0.0121 0.0
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/Sub 0.36 0.0078 0.0078 0.0
[10/30/2024-13:47:27] [I] /model.22/Sub 0.59 0.0128 0.0127 0.0
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/Add_1 0.28 0.0062 0.0061 0.0
[10/30/2024-13:47:27] [I] /model.22/Add_1 0.45 0.0099 0.0097 0.0
[10/30/2024-13:47:27] [I] /model.22/Add_2 0.40 0.0086 0.0086 0.0
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 1 to scale_eltwise_of_/model.22/Sub_1 0.34 0.0075 0.0072 0.0
[10/30/2024-13:47:27] [I] scale_eltwise_of_/model.22/Sub_1 0.41 0.0089 0.0087 0.0
[10/30/2024-13:47:27] [I] /model.22/Sub_1 0.41 0.0089 0.0087 0.0
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Output Tensor 0 to /model.22/Sub_1 0.41 0.0089 0.0089 0.0
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/Div_1 0.33 0.0072 0.0071 0.0
[10/30/2024-13:47:27] [I] /model.22/Div_1 0.35 0.0076 0.0076 0.0
[10/30/2024-13:47:27] [I] /model.22/Div_1_output_0 copy 0.30 0.0066 0.0065 0.0
[10/30/2024-13:47:27] [I] /model.22/Mul_2 0.57 0.0124 0.0122 0.0
[10/30/2024-13:47:27] [I] PWN(/model.22/Sigmoid) 4.18 0.0909 0.0903 0.1
[10/30/2024-13:47:27] [I] Total 3359.38 73.0301 72.9173 100.0
[10/30/2024-13:47:27] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=yolov8s_dla_b1_int8.engine --useDLACore=1 --dumpProfile

I wonder why there is a layer that Reformatting CopyNode for Output Tensor 2 to {ForeignNode[/model.0/conv/Conv.../model.22/Concat_2]}, which costs most part of time.
I have tried several YOLO models and find the same situation like this. Is it a problem with memory or the device i use now.

I am sincerely looking forward your advice and provide more information if needed as soon as possible.

By the way, I use nsys to test engine model:

root@miivii-tegra:/home/nvidia/workspace/v8/nvidia_example# nsys profile --trace=cuda,nvtx,cublas,cudla,cusparse,cudnn,nvmedia --output=dla /usr/src/tensorrt/bin/trtexec
 --loadEngine=yolov8s_dla_b1_int8.engine --iterations=10 --idleTime=500 --duration=0 --useSpinWait

I put the result .nsys-rep file into Nsight Systems.


I choose the last iteration result and find that the whole inference time is about 4.496ms.

The specific layer Reformatting CopyNode for Output Tensor 2 to {ForeignNode[/model.0/conv/Conv.../model.22/Concat_2]} just costs 0.219ms in the pic.

Why there is the difference between trtexec and nsys inference result.

Please see What does “Reformatting CopyNode for Input Tensor” mean in trtexec' dump profile · Issue #2136 · NVIDIA/TensorRT · GitHub if it helps to clarify. This involves input data format change that is fed into DLA.

nsys profiler expect to have some additional overhead as it insert additional hooks for profile/trace and cant be directly compared with trtexec timings.

Thanks for your answer. I will compare the difference.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.