Module:NVIDIA Jetson AGX Xavier (32 GB ram)
CUDA : 11.4.239
cuDNN:8.4.1.50
TensorRT:8.4.1.5
Jetpack:5.0.2
I try to use trtexec to transfer a YOLOv8 onnx model to TRT engine model, using DLA for inference.
A experiment refers to https://forums.developer.nvidia.com/t/dla-performance/302939
I donwload the onnx model “yolov8s_1400_512_bs1.onnx” from upper url.
In the terminal:
$ root@miivii-tegra:/home/nvidia/workspace/v8/nvidia_example# trtexec --onnx=yolov8s_1400_512_bs1.onnx --int8 --fp16 --best --useDLACore=1 --allowGPUFallback --saveEngine=yolov8s_dla_b1_int8.engine --verbose > test.log
test.log (1.8 MB)
$ root@miivii-tegra:/home/nvidia/workspace/v8/nvidia_example# trtexec --loadEngine=yolov8s_dla_b1_int8.engine --useDLACore=1 --dumpProfile
&&&& RUNNING TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=yolov8s_dla_b1_int8.engine --useDLACore=1 --dumpProfile
[10/30/2024-13:47:22] [I] === Model Options ===
[10/30/2024-13:47:22] [I] Format: *
[10/30/2024-13:47:22] [I] Model:
[10/30/2024-13:47:22] [I] Output:
[10/30/2024-13:47:22] [I] === Build Options ===
[10/30/2024-13:47:22] [I] Max batch: 1
[10/30/2024-13:47:22] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/30/2024-13:47:22] [I] minTiming: 1
[10/30/2024-13:47:22] [I] avgTiming: 8
[10/30/2024-13:47:22] [I] Precision: FP32
[10/30/2024-13:47:22] [I] LayerPrecisions:
[10/30/2024-13:47:22] [I] Calibration:
[10/30/2024-13:47:22] [I] Refit: Disabled
[10/30/2024-13:47:22] [I] Sparsity: Disabled
[10/30/2024-13:47:22] [I] Safe mode: Disabled
[10/30/2024-13:47:22] [I] DirectIO mode: Disabled
[10/30/2024-13:47:22] [I] Restricted mode: Disabled
[10/30/2024-13:47:22] [I] Build only: Disabled
[10/30/2024-13:47:22] [I] Save engine:
[10/30/2024-13:47:22] [I] Load engine: yolov8s_dla_b1_int8.engine
[10/30/2024-13:47:22] [I] Profiling verbosity: 0
[10/30/2024-13:47:22] [I] Tactic sources: Using default tactic sources
[10/30/2024-13:47:22] [I] timingCacheMode: local
[10/30/2024-13:47:22] [I] timingCacheFile:
[10/30/2024-13:47:22] [I] Input(s)s format: fp32:CHW
[10/30/2024-13:47:22] [I] Output(s)s format: fp32:CHW
[10/30/2024-13:47:22] [I] Input build shapes: model
[10/30/2024-13:47:22] [I] Input calibration shapes: model
[10/30/2024-13:47:22] [I] === System Options ===
[10/30/2024-13:47:22] [I] Device: 0
[10/30/2024-13:47:22] [I] DLACore: 1
[10/30/2024-13:47:22] [I] Plugins:
[10/30/2024-13:47:22] [I] === Inference Options ===
[10/30/2024-13:47:22] [I] Batch: 1
[10/30/2024-13:47:22] [I] Input inference shapes: model
[10/30/2024-13:47:22] [I] Iterations: 10
[10/30/2024-13:47:22] [I] Duration: 3s (+ 200ms warm up)
[10/30/2024-13:47:22] [I] Sleep time: 0ms
[10/30/2024-13:47:22] [I] Idle time: 0ms
[10/30/2024-13:47:22] [I] Streams: 1
[10/30/2024-13:47:22] [I] ExposeDMA: Disabled
[10/30/2024-13:47:22] [I] Data transfers: Enabled
[10/30/2024-13:47:22] [I] Spin-wait: Disabled
[10/30/2024-13:47:22] [I] Multithreading: Disabled
[10/30/2024-13:47:22] [I] CUDA Graph: Disabled
[10/30/2024-13:47:22] [I] Separate profiling: Disabled
[10/30/2024-13:47:22] [I] Time Deserialize: Disabled
[10/30/2024-13:47:22] [I] Time Refit: Disabled
[10/30/2024-13:47:22] [I] Inputs:
[10/30/2024-13:47:22] [I] === Reporting Options ===
[10/30/2024-13:47:22] [I] Verbose: Disabled
[10/30/2024-13:47:22] [I] Averages: 10 inferences
[10/30/2024-13:47:22] [I] Percentile: 99
[10/30/2024-13:47:22] [I] Dump refittable layers:Disabled
[10/30/2024-13:47:22] [I] Dump output: Disabled
[10/30/2024-13:47:22] [I] Profile: Enabled
[10/30/2024-13:47:22] [I] Export timing to JSON file:
[10/30/2024-13:47:22] [I] Export output to JSON file:
[10/30/2024-13:47:22] [I] Export profile to JSON file:
[10/30/2024-13:47:22] [I]
[10/30/2024-13:47:22] [I] === Device Information ===
[10/30/2024-13:47:22] [I] Selected Device: Xavier
[10/30/2024-13:47:22] [I] Compute Capability: 7.2
[10/30/2024-13:47:22] [I] SMs: 8
[10/30/2024-13:47:22] [I] Compute Clock Rate: 1.377 GHz
[10/30/2024-13:47:22] [I] Device Global Memory: 31009 MiB
[10/30/2024-13:47:22] [I] Shared Memory per SM: 96 KiB
[10/30/2024-13:47:22] [I] Memory Bus Width: 256 bits (ECC disabled)
[10/30/2024-13:47:22] [I] Memory Clock Rate: 1.377 GHz
[10/30/2024-13:47:22] [I]
[10/30/2024-13:47:22] [I] TensorRT version: 8.4.1
[10/30/2024-13:47:22] [I] Engine loaded in 0.0142044 sec.
[10/30/2024-13:47:23] [I] [TRT] [MemUsageChange] Init CUDA: CPU +185, GPU +0, now: CPU 221, GPU 8284 (MiB)
[10/30/2024-13:47:23] [I] [TRT] Loaded engine size: 12 MiB
[10/30/2024-13:47:23] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +11, GPU +0, now: CPU 11, GPU 0 (MiB)
[10/30/2024-13:47:23] [I] Engine deserialized in 0.979467 sec.
[10/30/2024-13:47:23] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +14, now: CPU 11, GPU 14 (MiB)
[10/30/2024-13:47:23] [I] Using random values for input images
[10/30/2024-13:47:23] [I] Created input binding for images with dimensions 1x3x1408x512
[10/30/2024-13:47:23] [I] Using random values for output output0
[10/30/2024-13:47:23] [I] Created output binding for output0 with dimensions 1x84x14784
[10/30/2024-13:47:23] [I] Starting inference
[10/30/2024-13:47:27] [I] The e2e network timing is not reported since it is inaccurate due to the extra synchronizations when the profiler is enabled.
[10/30/2024-13:47:27] [I] To show e2e network timing report, add --separateProfileRun to profile layer timing in a separate run or remove --dumpProfile to disable the profiler.
[10/30/2024-13:47:27] [I]
[10/30/2024-13:47:27] [I] === Profile (46 iterations ) ===
[10/30/2024-13:47:27] [I] Layer Time (ms) Avg. Time (ms) Median Time (ms) Time %
[10/30/2024-13:47:27] [I] images to nvm 9.16 0.1991 0.1988 0.3
[10/30/2024-13:47:27] [I] {ForeignNode[/model.0/conv/Conv…/model.22/Concat_2]} 11.00 0.2392 0.1850 0.3
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Output Tensor 2 to {ForeignNode[/model.0/conv/Conv…/model.22/Concat_2]} 3295.70 71.6457 71.5757 98.1
[10/30/2024-13:47:27] [I] images copy finish 0.11 0.0024 0.0023 0.0
[10/30/2024-13:47:27] [I] Reformatted Output Tensor 2 to {ForeignNode[/model.0/conv/Conv…/model.22/Concat_2]} finish 0.11 0.0023 0.0023 0.0
[10/30/2024-13:47:27] [I] /model.22/Reshape 8.14 0.1769 0.1763 0.2
[10/30/2024-13:47:27] [I] /model.22/Concat_output_0 finish 0.17 0.0037 0.0036 0.0
[10/30/2024-13:47:27] [I] /model.22/Reshape_copy_output 3.88 0.0844 0.0842 0.1
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/Reshape_1 1.88 0.0408 0.0407 0.1
[10/30/2024-13:47:27] [I] /model.22/Concat_1_output_0 finish 0.18 0.0038 0.0038 0.0
[10/30/2024-13:47:27] [I] /model.22/Reshape_1_copy_output 1.29 0.0280 0.0279 0.0
[10/30/2024-13:47:27] [I] /model.22/Reshape_2_copy_output 0.73 0.0159 0.0156 0.0
[10/30/2024-13:47:27] [I] /model.22/dfl/Reshape + /model.22/dfl/Transpose 4.52 0.0982 0.0979 0.1
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/dfl/Softmax 3.34 0.0725 0.0722 0.1
[10/30/2024-13:47:27] [I] /model.22/dfl/Softmax 2.34 0.0510 0.0510 0.1
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/dfl/conv/Conv 3.27 0.0710 0.0709 0.1
[10/30/2024-13:47:27] [I] /model.22/dfl/conv/Conv 2.75 0.0599 0.0599 0.1
[10/30/2024-13:47:27] [I] /model.22/dfl/Reshape_1 0.46 0.0099 0.0098 0.0
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 1 to scale_eltwise_of_/model.22/Sub 0.43 0.0093 0.0092 0.0
[10/30/2024-13:47:27] [I] scale_eltwise_of_/model.22/Sub 0.56 0.0122 0.0121 0.0
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/Sub 0.36 0.0078 0.0078 0.0
[10/30/2024-13:47:27] [I] /model.22/Sub 0.59 0.0128 0.0127 0.0
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/Add_1 0.28 0.0062 0.0061 0.0
[10/30/2024-13:47:27] [I] /model.22/Add_1 0.45 0.0099 0.0097 0.0
[10/30/2024-13:47:27] [I] /model.22/Add_2 0.40 0.0086 0.0086 0.0
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 1 to scale_eltwise_of_/model.22/Sub_1 0.34 0.0075 0.0072 0.0
[10/30/2024-13:47:27] [I] scale_eltwise_of_/model.22/Sub_1 0.41 0.0089 0.0087 0.0
[10/30/2024-13:47:27] [I] /model.22/Sub_1 0.41 0.0089 0.0087 0.0
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Output Tensor 0 to /model.22/Sub_1 0.41 0.0089 0.0089 0.0
[10/30/2024-13:47:27] [I] Reformatting CopyNode for Input Tensor 0 to /model.22/Div_1 0.33 0.0072 0.0071 0.0
[10/30/2024-13:47:27] [I] /model.22/Div_1 0.35 0.0076 0.0076 0.0
[10/30/2024-13:47:27] [I] /model.22/Div_1_output_0 copy 0.30 0.0066 0.0065 0.0
[10/30/2024-13:47:27] [I] /model.22/Mul_2 0.57 0.0124 0.0122 0.0
[10/30/2024-13:47:27] [I] PWN(/model.22/Sigmoid) 4.18 0.0909 0.0903 0.1
[10/30/2024-13:47:27] [I] Total 3359.38 73.0301 72.9173 100.0
[10/30/2024-13:47:27] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=yolov8s_dla_b1_int8.engine --useDLACore=1 --dumpProfile
I wonder why there is a layer that Reformatting CopyNode for Output Tensor 2 to {ForeignNode[/model.0/conv/Conv.../model.22/Concat_2]}
, which costs most part of time.
I have tried several YOLO models and find the same situation like this. Is it a problem with memory or the device i use now.
I am sincerely looking forward your advice and provide more information if needed as soon as possible.