Please provide complete information as applicable to your setup.
• Hardware Platform (Jetson / GPU) Jetson
• DeepStream Version 6.3
• JetPack Version (valid for Jetson only) 5.1.2
• TensorRT Version 8.5.2
• NVIDIA GPU Driver Version (valid for GPU only)
• Issue Type( questions, new requirements, bugs)
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)
I use dla to imporve the performance of the plate recognition on xaiver,
There are some warnings when deepstream generate dla engine, I think the dla not support some layers, the log is as follows
0:00:05.676002627 3730290 0xaaaac492a0c0 INFO nvinfer gstnvinfer.cpp:693:gst_nvinfer_logger:<secondary_gie_0> NvDsInferContext[UID 3]: Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:2002> [UID = 3]: Trying to create engine from model files
WARNING: [TRT]: onnx2trt_utils.cpp:375: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
WARNING: DLA does not support FP32 precision type, using FP16 mode.
WARNING: [TRT]: Layer 'GlobalAveragePool_26' (REDUCE): Unsupported on DLA. Switching this layer's device type to GPU.
WARNING: [TRT]: Layer '(Unnamed Layer* 27) [Shape]' (SHAPE): DLA only supports FP16 and Int8 precision type. Switching this layer's device type to GPU.
WARNING: [TRT]: Layer '(Unnamed Layer* 28) [Constant]' (CONSTANT): Unsupported on DLA. Switching this layer's device type to GPU.
WARNING: [TRT]: Layer '(Unnamed Layer* 29) [Gather]' (GATHER): Unsupported on DLA. Switching this layer's device type to GPU.
WARNING: [TRT]: Layer '(Unnamed Layer* 30) [Constant]' (CONSTANT): Unsupported on DLA. Switching this layer's device type to GPU.
WARNING: [TRT]: (Unnamed Layer* 31) [Concatenation]: DLA only supports concatenation on the C dimension.
WARNING: [TRT]: Layer '(Unnamed Layer* 31) [Concatenation]' (CONCATENATION): Unsupported on DLA. Switching this layer's device type to GPU.
WARNING: [TRT]: Layer '(Unnamed Layer* 35) [Shape]' (SHAPE): DLA only supports FP16 and Int8 precision type. Switching this layer's device type to GPU.
WARNING: [TRT]: Layer '(Unnamed Layer* 36) [Constant]' (CONSTANT): Unsupported on DLA. Switching this layer's device type to GPU.
WARNING: [TRT]: Layer '(Unnamed Layer* 37) [Gather]' (GATHER): Unsupported on DLA. Switching this layer's device type to GPU.
WARNING: [TRT]: Layer 'Transpose_38' (SHUFFLE): Unsupported on DLA. Switching this layer's device type to GPU.
WARNING: [TRT]: DLA only supports shuffle with 4-D input and output
WARNING: [TRT]: Validation failed for DLA layer: node_of_64 + Transpose_38. Switching to GPU fallback.
WARNING: [TRT]: Splitting DLA subgraph at: node_of_64 + Transpose_38 because DLA validation failed for this layer.
WARNING: [TRT]: DLA only supports shuffle with 4-D input and output
WARNING: [TRT]: Validation failed for DLA layer: node_of_64 + Transpose_38. Switching to GPU fallback.
WARNING: [TRT]: DLA only supports shuffle with 4-D input and output
WARNING: [TRT]: Validation failed for DLA layer: Flatten_27. Switching to GPU fallback.
INFO: suggestedPathName: /home/mic-730ai/david/code/ecu-ds63/ds-app/ds-engine/plate_rec_color_small.onnx_b12_gpu0_dla0_fp16.engine
but I found the performace not increase, the log is as follows.
Using dla:
$ /usr/src/tensorrt/bin/trtexec --loadEngine=./plate_rec_color_small.onnx_b12_gpu0_dla0_fp16.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=./plate_rec_color_small.onnx_b12_gpu0_dla0_fp16.engine
[12/18/2024-10:43:18] [I]
[12/18/2024-10:43:18] [I] === Device Information ===
[12/18/2024-10:43:18] [I] Selected Device: Xavier
[12/18/2024-10:43:18] [I] Compute Capability: 7.2
[12/18/2024-10:43:18] [I] SMs: 8
[12/18/2024-10:43:18] [I] Compute Clock Rate: 1.377 GHz
[12/18/2024-10:43:18] [I] Device Global Memory: 30990 MiB
[12/18/2024-10:43:18] [I] Shared Memory per SM: 96 KiB
[12/18/2024-10:43:18] [I] Memory Bus Width: 256 bits (ECC disabled)
[12/18/2024-10:43:18] [I] Memory Clock Rate: 1.377 GHz
[12/18/2024-10:43:18] [I]
[12/18/2024-10:43:18] [I] TensorRT version: 8.5.2
[12/18/2024-10:43:18] [I] Engine loaded in 0.00463397 sec.
[12/18/2024-10:43:19] [I] [TRT] Loaded engine size: 1 MiB
[12/18/2024-10:43:19] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +1, GPU +0, now: CPU 1, GPU 0 (MiB)
[12/18/2024-10:43:19] [I] Engine deserialized in 0.919735 sec.
[12/18/2024-10:43:19] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1, now: CPU 1, GPU 1 (MiB)
[12/18/2024-10:43:19] [I] Setting persistentCacheLimit to 0 bytes.
[12/18/2024-10:43:19] [I] Using random values for input images
[12/18/2024-10:43:19] [I] Created input binding for images with dimensions 12x3x48x168
[12/18/2024-10:43:19] [I] Using random values for output output_2
[12/18/2024-10:43:19] [I] Created output binding for output_2 with dimensions 12x5
[12/18/2024-10:43:19] [I] Using random values for output output_1
[12/18/2024-10:43:19] [I] Created output binding for output_1 with dimensions 12x21x78
[12/18/2024-10:43:19] [I] Starting inference
[12/18/2024-10:43:23] [I] Warmup completed 34 queries over 200 ms
[12/18/2024-10:43:23] [I] Timing trace has 501 queries over 3.01719 s
[12/18/2024-10:43:23] [I]
[12/18/2024-10:43:23] [I] === Trace details ===
[12/18/2024-10:43:23] [I] Trace averages of 10 runs:
[12/18/2024-10:43:23] [I] Average on 10 runs - GPU latency: 6.01292 ms - Host latency: 6.46045 ms (enqueue 0.42644 ms)
[12/18/2024-10:43:23] [I]
[12/18/2024-10:43:23] [I] === Performance summary ===
[12/18/2024-10:43:23] [I] Throughput: 166.049 qps
[12/18/2024-10:43:23] [I] Latency: min = 6.40771 ms, max = 6.54456 ms, mean = 6.45381 ms, median = 6.45264 ms, percentile(90%) = 6.4812 ms, percentile(95%) = 6.48999 ms, percentile(99%) = 6.52087 ms
[12/18/2024-10:43:23] [I] Enqueue Time: min = 0.347656 ms, max = 4.32959 ms, mean = 0.538269 ms, median = 0.468643 ms, percentile(90%) = 0.767822 ms, percentile(95%) = 0.829895 ms, percentile(99%) = 0.896973 ms
[12/18/2024-10:43:23] [I] H2D Latency: min = 0.390381 ms, max = 0.45166 ms, mean = 0.398451 ms, median = 0.397461 ms, percentile(90%) = 0.400146 ms, percentile(95%) = 0.403809 ms, percentile(99%) = 0.424194 ms
[12/18/2024-10:43:23] [I] GPU Compute Time: min = 5.96246 ms, max = 6.09521 ms, mean = 6.00524 ms, median = 6.00415 ms, percentile(90%) = 6.03296 ms, percentile(95%) = 6.04028 ms, percentile(99%) = 6.07178 ms
[12/18/2024-10:43:23] [I] D2H Latency: min = 0.0405273 ms, max = 0.0531006 ms, mean = 0.0501165 ms, median = 0.0501709 ms, percentile(90%) = 0.0513916 ms, percentile(95%) = 0.0517883 ms, percentile(99%) = 0.0525513 ms
[12/18/2024-10:43:23] [I] Total Host Walltime: 3.01719 s
[12/18/2024-10:43:23] [I] Total GPU Compute Time: 3.00863 s
[12/18/2024-10:43:23] [I] Explanations of the performance metrics are printed in the verbose logs.
[12/18/2024-10:43:23] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=./plate_rec_color_small.onnx_b12_gpu0_dla0_fp16.engine
Not using dla:
/usr/src/tensorrt/bin/trtexec --loadEngine=./plate_rec_color_small.onnx_b12_gpu0_fp16.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=./plate_rec_color_small.onnx_b12_gpu0_fp16.engine
[12/18/2024-10:43:27] [I] DLACore:
[12/18/2024-10:43:27] [I] Plugins:
[12/18/2024-10:43:27] [I] === Inference Options ===
[12/18/2024-10:43:27] [I] Batch: 1
[12/18/2024-10:43:27] [I] Input inference shapes: model
[12/18/2024-10:43:27] [I] Iterations: 10
[12/18/2024-10:43:27] [I] Duration: 3s (+ 200ms warm up)
[12/18/2024-10:43:27] [I] Sleep time: 0ms
[12/18/2024-10:43:27] [I] Idle time: 0ms
[12/18/2024-10:43:27] [I] Streams: 1
[12/18/2024-10:43:27] [I] ExposeDMA: Disabled
[12/18/2024-10:43:27] [I] Data transfers: Enabled
[12/18/2024-10:43:27] [I] Spin-wait: Disabled
[12/18/2024-10:43:27] [I] Multithreading: Disabled
[12/18/2024-10:43:27] [I] CUDA Graph: Disabled
[12/18/2024-10:43:27] [I] Separate profiling: Disabled
[12/18/2024-10:43:27] [I] Time Deserialize: Disabled
[12/18/2024-10:43:27] [I] Time Refit: Disabled
[12/18/2024-10:43:27] [I] NVTX verbosity: 0
[12/18/2024-10:43:27] [I] Persistent Cache Ratio: 0
[12/18/2024-10:43:27] [I] Inputs:
[12/18/2024-10:43:27] [I] === Reporting Options ===
[12/18/2024-10:43:27] [I] Verbose: Disabled
[12/18/2024-10:43:27] [I] Averages: 10 inferences
[12/18/2024-10:43:27] [I] Percentiles: 90,95,99
[12/18/2024-10:43:27] [I] Dump refittable layers:Disabled
[12/18/2024-10:43:27] [I] Dump output: Disabled
[12/18/2024-10:43:27] [I] Profile: Disabled
[12/18/2024-10:43:27] [I] Export timing to JSON file:
[12/18/2024-10:43:27] [I] Export output to JSON file:
[12/18/2024-10:43:27] [I] Export profile to JSON file:
[12/18/2024-10:43:27] [I]
[12/18/2024-10:43:27] [I] === Device Information ===
[12/18/2024-10:43:27] [I] Selected Device: Xavier
[12/18/2024-10:43:27] [I] Compute Capability: 7.2
[12/18/2024-10:43:27] [I] SMs: 8
[12/18/2024-10:43:27] [I] Compute Clock Rate: 1.377 GHz
[12/18/2024-10:43:27] [I] Device Global Memory: 30990 MiB
[12/18/2024-10:43:27] [I] Shared Memory per SM: 96 KiB
[12/18/2024-10:43:27] [I] Memory Bus Width: 256 bits (ECC disabled)
[12/18/2024-10:43:27] [I] Memory Clock Rate: 1.377 GHz
[12/18/2024-10:43:27] [I]
[12/18/2024-10:43:27] [I] TensorRT version: 8.5.2
[12/18/2024-10:43:27] [I] Engine loaded in 0.00244214 sec.
[12/18/2024-10:43:27] [I] [TRT] Loaded engine size: 0 MiB
[12/18/2024-10:43:28] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[12/18/2024-10:43:28] [I] Engine deserialized in 0.947943 sec.
[12/18/2024-10:43:28] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +5, now: CPU 0, GPU 5 (MiB)
[12/18/2024-10:43:28] [I] Setting persistentCacheLimit to 0 bytes.
[12/18/2024-10:43:28] [W] Shape missing for input with dynamic shape: imagesAutomatically setting shape to: 1x3x48x168
[12/18/2024-10:43:28] [I] Using random values for input images
[12/18/2024-10:43:28] [I] Created input binding for images with dimensions 1x3x48x168
[12/18/2024-10:43:28] [I] Using random values for output output_2
[12/18/2024-10:43:28] [I] Created output binding for output_2 with dimensions 1x5
[12/18/2024-10:43:28] [I] Using random values for output output_1
[12/18/2024-10:43:28] [I] Created output binding for output_1 with dimensions 1x21x78
[12/18/2024-10:43:28] [I] Starting inference
[12/18/2024-10:43:31] [I] Warmup completed 146 queries over 200 ms
[12/18/2024-10:43:31] [I] Timing trace has 9516 queries over 3.00131 s
[12/18/2024-10:43:31] [I]
[12/18/2024-10:43:31] [I] === Trace details ===
[12/18/2024-10:43:31] [I] Trace averages of 10 runs:
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.314313 ms - Host latency: 0.328241 ms (enqueue 0.192389 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.314954 ms - Host latency: 0.330383 ms (enqueue 0.196613 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.313416 ms - Host latency: 0.326752 ms (enqueue 0.191602 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.31377 ms - Host latency: 0.326819 ms (enqueue 0.191174 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.313239 ms - Host latency: 0.327753 ms (enqueue 0.19267 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.31192 ms - Host latency: 0.325085 ms (enqueue 0.200092 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.312909 ms - Host latency: 0.325946 ms (enqueue 0.196893 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.312524 ms - Host latency: 0.325525 ms (enqueue 0.191577 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.312085 ms - Host latency: 0.325183 ms (enqueue 0.196155 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.313843 ms - Host latency: 0.327246 ms (enqueue 0.194348 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.312634 ms - Host latency: 0.326648 ms (enqueue 0.193518 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.313354 ms - Host latency: 0.326318 ms (enqueue 0.193164 ms)
[12/18/2024-10:43:31] [I]
[12/18/2024-10:43:31] [I] === Performance summary ===
[12/18/2024-10:43:31] [I] Throughput: 3170.61 qps
[12/18/2024-10:43:31] [I] Latency: min = 0.319092 ms, max = 0.653885 ms, mean = 0.328242 ms, median = 0.325928 ms, percentile(90%) = 0.3302 ms, percentile(95%) = 0.34082 ms, percentile(99%) = 0.359589 ms
[12/18/2024-10:43:31] [I] Enqueue Time: min = 0.179443 ms, max = 0.385986 ms, mean = 0.192961 ms, median = 0.188828 ms, percentile(90%) = 0.203705 ms, percentile(95%) = 0.2229 ms, percentile(99%) = 0.245117 ms
[12/18/2024-10:43:31] [I] H2D Latency: min = 0.00634766 ms, max = 0.0775757 ms, mean = 0.00997222 ms, median = 0.00952148 ms, percentile(90%) = 0.010376 ms, percentile(95%) = 0.0112305 ms, percentile(99%) = 0.026123 ms
[12/18/2024-10:43:31] [I] GPU Compute Time: min = 0.306396 ms, max = 0.634872 ms, mean = 0.314492 ms, median = 0.312622 ms, percentile(90%) = 0.315918 ms, percentile(95%) = 0.317505 ms, percentile(99%) = 0.345062 ms
[12/18/2024-10:43:31] [I] D2H Latency: min = 0.00219727 ms, max = 0.00866699 ms, mean = 0.00377876 ms, median = 0.00384521 ms, percentile(90%) = 0.00463867 ms, percentile(95%) = 0.00476074 ms, percentile(99%) = 0.00512695 ms
[12/18/2024-10:43:31] [I] Total Host Walltime: 3.00131 s
[12/18/2024-10:43:31] [I] Total GPU Compute Time: 2.9927 s
[12/18/2024-10:43:31] [W] * GPU compute time is unstable, with coefficient of variance = 4.96875%.
[12/18/2024-10:43:31] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[12/18/2024-10:43:31] [I] Explanations of the performance metrics are printed in the verbose logs.
[12/18/2024-10:43:31] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=./plate_rec_color_small.onnx_b12_gpu0_fp16.engine
Why dla not worked? how to deal it?