The throughput not increase when using dla on xavier

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) Jetson
• DeepStream Version 6.3
• JetPack Version (valid for Jetson only) 5.1.2
• TensorRT Version 8.5.2
• NVIDIA GPU Driver Version (valid for GPU only)
• Issue Type( questions, new requirements, bugs)
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

I use dla to imporve the performance of the plate recognition on xaiver,

There are some warnings when deepstream generate dla engine, I think the dla not support some layers, the log is as follows

0:00:05.676002627 3730290 0xaaaac492a0c0 INFO                 nvinfer gstnvinfer.cpp:693:gst_nvinfer_logger:<secondary_gie_0> NvDsInferContext[UID 3]: Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:2002> [UID = 3]: Trying to create engine from model files
WARNING: [TRT]: onnx2trt_utils.cpp:375: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
WARNING: DLA does not support FP32 precision type, using FP16 mode.
WARNING: [TRT]: Layer 'GlobalAveragePool_26' (REDUCE): Unsupported on DLA. Switching this layer's device type to GPU.
WARNING: [TRT]: Layer '(Unnamed Layer* 27) [Shape]' (SHAPE): DLA only supports FP16 and Int8 precision type. Switching this layer's device type to GPU.
WARNING: [TRT]: Layer '(Unnamed Layer* 28) [Constant]' (CONSTANT): Unsupported on DLA. Switching this layer's device type to GPU.
WARNING: [TRT]: Layer '(Unnamed Layer* 29) [Gather]' (GATHER): Unsupported on DLA. Switching this layer's device type to GPU.
WARNING: [TRT]: Layer '(Unnamed Layer* 30) [Constant]' (CONSTANT): Unsupported on DLA. Switching this layer's device type to GPU.
WARNING: [TRT]: (Unnamed Layer* 31) [Concatenation]: DLA only supports concatenation on the C dimension.
WARNING: [TRT]: Layer '(Unnamed Layer* 31) [Concatenation]' (CONCATENATION): Unsupported on DLA. Switching this layer's device type to GPU.
WARNING: [TRT]: Layer '(Unnamed Layer* 35) [Shape]' (SHAPE): DLA only supports FP16 and Int8 precision type. Switching this layer's device type to GPU.
WARNING: [TRT]: Layer '(Unnamed Layer* 36) [Constant]' (CONSTANT): Unsupported on DLA. Switching this layer's device type to GPU.
WARNING: [TRT]: Layer '(Unnamed Layer* 37) [Gather]' (GATHER): Unsupported on DLA. Switching this layer's device type to GPU.
WARNING: [TRT]: Layer 'Transpose_38' (SHUFFLE): Unsupported on DLA. Switching this layer's device type to GPU.
WARNING: [TRT]: DLA only supports shuffle with 4-D input and output
WARNING: [TRT]: Validation failed for DLA layer: node_of_64 + Transpose_38. Switching to GPU fallback.
WARNING: [TRT]: Splitting DLA subgraph at: node_of_64 + Transpose_38 because DLA validation failed for this layer.
WARNING: [TRT]: DLA only supports shuffle with 4-D input and output
WARNING: [TRT]: Validation failed for DLA layer: node_of_64 + Transpose_38. Switching to GPU fallback.
WARNING: [TRT]: DLA only supports shuffle with 4-D input and output
WARNING: [TRT]: Validation failed for DLA layer: Flatten_27. Switching to GPU fallback.
INFO: suggestedPathName: /home/mic-730ai/david/code/ecu-ds63/ds-app/ds-engine/plate_rec_color_small.onnx_b12_gpu0_dla0_fp16.engine

but I found the performace not increase, the log is as follows.
Using dla:

$ /usr/src/tensorrt/bin/trtexec --loadEngine=./plate_rec_color_small.onnx_b12_gpu0_dla0_fp16.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=./plate_rec_color_small.onnx_b12_gpu0_dla0_fp16.engine
[12/18/2024-10:43:18] [I] 
[12/18/2024-10:43:18] [I] === Device Information ===
[12/18/2024-10:43:18] [I] Selected Device: Xavier
[12/18/2024-10:43:18] [I] Compute Capability: 7.2
[12/18/2024-10:43:18] [I] SMs: 8
[12/18/2024-10:43:18] [I] Compute Clock Rate: 1.377 GHz
[12/18/2024-10:43:18] [I] Device Global Memory: 30990 MiB
[12/18/2024-10:43:18] [I] Shared Memory per SM: 96 KiB
[12/18/2024-10:43:18] [I] Memory Bus Width: 256 bits (ECC disabled)
[12/18/2024-10:43:18] [I] Memory Clock Rate: 1.377 GHz
[12/18/2024-10:43:18] [I] 
[12/18/2024-10:43:18] [I] TensorRT version: 8.5.2
[12/18/2024-10:43:18] [I] Engine loaded in 0.00463397 sec.
[12/18/2024-10:43:19] [I] [TRT] Loaded engine size: 1 MiB
[12/18/2024-10:43:19] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +1, GPU +0, now: CPU 1, GPU 0 (MiB)
[12/18/2024-10:43:19] [I] Engine deserialized in 0.919735 sec.
[12/18/2024-10:43:19] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1, now: CPU 1, GPU 1 (MiB)
[12/18/2024-10:43:19] [I] Setting persistentCacheLimit to 0 bytes.
[12/18/2024-10:43:19] [I] Using random values for input images
[12/18/2024-10:43:19] [I] Created input binding for images with dimensions 12x3x48x168
[12/18/2024-10:43:19] [I] Using random values for output output_2
[12/18/2024-10:43:19] [I] Created output binding for output_2 with dimensions 12x5
[12/18/2024-10:43:19] [I] Using random values for output output_1
[12/18/2024-10:43:19] [I] Created output binding for output_1 with dimensions 12x21x78
[12/18/2024-10:43:19] [I] Starting inference
[12/18/2024-10:43:23] [I] Warmup completed 34 queries over 200 ms
[12/18/2024-10:43:23] [I] Timing trace has 501 queries over 3.01719 s
[12/18/2024-10:43:23] [I] 
[12/18/2024-10:43:23] [I] === Trace details ===
[12/18/2024-10:43:23] [I] Trace averages of 10 runs:
[12/18/2024-10:43:23] [I] Average on 10 runs - GPU latency: 6.01292 ms - Host latency: 6.46045 ms (enqueue 0.42644 ms)
[12/18/2024-10:43:23] [I] 
[12/18/2024-10:43:23] [I] === Performance summary ===
[12/18/2024-10:43:23] [I] Throughput: 166.049 qps
[12/18/2024-10:43:23] [I] Latency: min = 6.40771 ms, max = 6.54456 ms, mean = 6.45381 ms, median = 6.45264 ms, percentile(90%) = 6.4812 ms, percentile(95%) = 6.48999 ms, percentile(99%) = 6.52087 ms
[12/18/2024-10:43:23] [I] Enqueue Time: min = 0.347656 ms, max = 4.32959 ms, mean = 0.538269 ms, median = 0.468643 ms, percentile(90%) = 0.767822 ms, percentile(95%) = 0.829895 ms, percentile(99%) = 0.896973 ms
[12/18/2024-10:43:23] [I] H2D Latency: min = 0.390381 ms, max = 0.45166 ms, mean = 0.398451 ms, median = 0.397461 ms, percentile(90%) = 0.400146 ms, percentile(95%) = 0.403809 ms, percentile(99%) = 0.424194 ms
[12/18/2024-10:43:23] [I] GPU Compute Time: min = 5.96246 ms, max = 6.09521 ms, mean = 6.00524 ms, median = 6.00415 ms, percentile(90%) = 6.03296 ms, percentile(95%) = 6.04028 ms, percentile(99%) = 6.07178 ms
[12/18/2024-10:43:23] [I] D2H Latency: min = 0.0405273 ms, max = 0.0531006 ms, mean = 0.0501165 ms, median = 0.0501709 ms, percentile(90%) = 0.0513916 ms, percentile(95%) = 0.0517883 ms, percentile(99%) = 0.0525513 ms
[12/18/2024-10:43:23] [I] Total Host Walltime: 3.01719 s
[12/18/2024-10:43:23] [I] Total GPU Compute Time: 3.00863 s
[12/18/2024-10:43:23] [I] Explanations of the performance metrics are printed in the verbose logs.
[12/18/2024-10:43:23] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=./plate_rec_color_small.onnx_b12_gpu0_dla0_fp16.engine

Not using dla:

/usr/src/tensorrt/bin/trtexec --loadEngine=./plate_rec_color_small.onnx_b12_gpu0_fp16.engine   
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=./plate_rec_color_small.onnx_b12_gpu0_fp16.engine
[12/18/2024-10:43:27] [I] DLACore: 
[12/18/2024-10:43:27] [I] Plugins:
[12/18/2024-10:43:27] [I] === Inference Options ===
[12/18/2024-10:43:27] [I] Batch: 1
[12/18/2024-10:43:27] [I] Input inference shapes: model
[12/18/2024-10:43:27] [I] Iterations: 10
[12/18/2024-10:43:27] [I] Duration: 3s (+ 200ms warm up)
[12/18/2024-10:43:27] [I] Sleep time: 0ms
[12/18/2024-10:43:27] [I] Idle time: 0ms
[12/18/2024-10:43:27] [I] Streams: 1
[12/18/2024-10:43:27] [I] ExposeDMA: Disabled
[12/18/2024-10:43:27] [I] Data transfers: Enabled
[12/18/2024-10:43:27] [I] Spin-wait: Disabled
[12/18/2024-10:43:27] [I] Multithreading: Disabled
[12/18/2024-10:43:27] [I] CUDA Graph: Disabled
[12/18/2024-10:43:27] [I] Separate profiling: Disabled
[12/18/2024-10:43:27] [I] Time Deserialize: Disabled
[12/18/2024-10:43:27] [I] Time Refit: Disabled
[12/18/2024-10:43:27] [I] NVTX verbosity: 0
[12/18/2024-10:43:27] [I] Persistent Cache Ratio: 0
[12/18/2024-10:43:27] [I] Inputs:
[12/18/2024-10:43:27] [I] === Reporting Options ===
[12/18/2024-10:43:27] [I] Verbose: Disabled
[12/18/2024-10:43:27] [I] Averages: 10 inferences
[12/18/2024-10:43:27] [I] Percentiles: 90,95,99
[12/18/2024-10:43:27] [I] Dump refittable layers:Disabled
[12/18/2024-10:43:27] [I] Dump output: Disabled
[12/18/2024-10:43:27] [I] Profile: Disabled
[12/18/2024-10:43:27] [I] Export timing to JSON file: 
[12/18/2024-10:43:27] [I] Export output to JSON file: 
[12/18/2024-10:43:27] [I] Export profile to JSON file: 
[12/18/2024-10:43:27] [I] 
[12/18/2024-10:43:27] [I] === Device Information ===
[12/18/2024-10:43:27] [I] Selected Device: Xavier
[12/18/2024-10:43:27] [I] Compute Capability: 7.2
[12/18/2024-10:43:27] [I] SMs: 8
[12/18/2024-10:43:27] [I] Compute Clock Rate: 1.377 GHz
[12/18/2024-10:43:27] [I] Device Global Memory: 30990 MiB
[12/18/2024-10:43:27] [I] Shared Memory per SM: 96 KiB
[12/18/2024-10:43:27] [I] Memory Bus Width: 256 bits (ECC disabled)
[12/18/2024-10:43:27] [I] Memory Clock Rate: 1.377 GHz
[12/18/2024-10:43:27] [I] 
[12/18/2024-10:43:27] [I] TensorRT version: 8.5.2
[12/18/2024-10:43:27] [I] Engine loaded in 0.00244214 sec.
[12/18/2024-10:43:27] [I] [TRT] Loaded engine size: 0 MiB
[12/18/2024-10:43:28] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[12/18/2024-10:43:28] [I] Engine deserialized in 0.947943 sec.
[12/18/2024-10:43:28] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +5, now: CPU 0, GPU 5 (MiB)
[12/18/2024-10:43:28] [I] Setting persistentCacheLimit to 0 bytes.
[12/18/2024-10:43:28] [W] Shape missing for input with dynamic shape: imagesAutomatically setting shape to: 1x3x48x168
[12/18/2024-10:43:28] [I] Using random values for input images
[12/18/2024-10:43:28] [I] Created input binding for images with dimensions 1x3x48x168
[12/18/2024-10:43:28] [I] Using random values for output output_2
[12/18/2024-10:43:28] [I] Created output binding for output_2 with dimensions 1x5
[12/18/2024-10:43:28] [I] Using random values for output output_1
[12/18/2024-10:43:28] [I] Created output binding for output_1 with dimensions 1x21x78
[12/18/2024-10:43:28] [I] Starting inference
[12/18/2024-10:43:31] [I] Warmup completed 146 queries over 200 ms
[12/18/2024-10:43:31] [I] Timing trace has 9516 queries over 3.00131 s
[12/18/2024-10:43:31] [I] 
[12/18/2024-10:43:31] [I] === Trace details ===
[12/18/2024-10:43:31] [I] Trace averages of 10 runs:
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.314313 ms - Host latency: 0.328241 ms (enqueue 0.192389 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.314954 ms - Host latency: 0.330383 ms (enqueue 0.196613 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.313416 ms - Host latency: 0.326752 ms (enqueue 0.191602 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.31377 ms - Host latency: 0.326819 ms (enqueue 0.191174 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.313239 ms - Host latency: 0.327753 ms (enqueue 0.19267 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.31192 ms - Host latency: 0.325085 ms (enqueue 0.200092 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.312909 ms - Host latency: 0.325946 ms (enqueue 0.196893 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.312524 ms - Host latency: 0.325525 ms (enqueue 0.191577 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.312085 ms - Host latency: 0.325183 ms (enqueue 0.196155 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.313843 ms - Host latency: 0.327246 ms (enqueue 0.194348 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.312634 ms - Host latency: 0.326648 ms (enqueue 0.193518 ms)
[12/18/2024-10:43:31] [I] Average on 10 runs - GPU latency: 0.313354 ms - Host latency: 0.326318 ms (enqueue 0.193164 ms)
[12/18/2024-10:43:31] [I] 
[12/18/2024-10:43:31] [I] === Performance summary ===
[12/18/2024-10:43:31] [I] Throughput: 3170.61 qps
[12/18/2024-10:43:31] [I] Latency: min = 0.319092 ms, max = 0.653885 ms, mean = 0.328242 ms, median = 0.325928 ms, percentile(90%) = 0.3302 ms, percentile(95%) = 0.34082 ms, percentile(99%) = 0.359589 ms
[12/18/2024-10:43:31] [I] Enqueue Time: min = 0.179443 ms, max = 0.385986 ms, mean = 0.192961 ms, median = 0.188828 ms, percentile(90%) = 0.203705 ms, percentile(95%) = 0.2229 ms, percentile(99%) = 0.245117 ms
[12/18/2024-10:43:31] [I] H2D Latency: min = 0.00634766 ms, max = 0.0775757 ms, mean = 0.00997222 ms, median = 0.00952148 ms, percentile(90%) = 0.010376 ms, percentile(95%) = 0.0112305 ms, percentile(99%) = 0.026123 ms
[12/18/2024-10:43:31] [I] GPU Compute Time: min = 0.306396 ms, max = 0.634872 ms, mean = 0.314492 ms, median = 0.312622 ms, percentile(90%) = 0.315918 ms, percentile(95%) = 0.317505 ms, percentile(99%) = 0.345062 ms
[12/18/2024-10:43:31] [I] D2H Latency: min = 0.00219727 ms, max = 0.00866699 ms, mean = 0.00377876 ms, median = 0.00384521 ms, percentile(90%) = 0.00463867 ms, percentile(95%) = 0.00476074 ms, percentile(99%) = 0.00512695 ms
[12/18/2024-10:43:31] [I] Total Host Walltime: 3.00131 s
[12/18/2024-10:43:31] [I] Total GPU Compute Time: 2.9927 s
[12/18/2024-10:43:31] [W] * GPU compute time is unstable, with coefficient of variance = 4.96875%.
[12/18/2024-10:43:31] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[12/18/2024-10:43:31] [I] Explanations of the performance metrics are printed in the verbose logs.
[12/18/2024-10:43:31] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=./plate_rec_color_small.onnx_b12_gpu0_fp16.engine

Why dla not worked? how to deal it?

I changed ‘GlobalAveragePool’ to ‘AveragePool’ because ‘AveragePool’ supported by dla, but I found the throughput decreased.

[12/18/2024-11:19:45] [I] === Performance summary ===
[12/18/2024-11:19:45] [I] Throughput: 154.209 qps
[12/18/2024-11:19:45] [I] Latency: min = 6.1745 ms, max = 7.21765 ms, mean = 6.92169 ms, median = 7.12988 ms, percentile(90%) = 7.15894 ms, percentile(95%) = 7.16577 ms, percentile(99%) = 7.18213 ms
[12/18/2024-11:19:45] [I] Enqueue Time: min = 0.219238 ms, max = 0.756348 ms, mean = 0.347025 ms, median = 0.303955 ms, percentile(90%) = 0.504639 ms, percentile(95%) = 0.540527 ms, percentile(99%) = 0.605011 ms
[12/18/2024-11:19:45] [I] H2D Latency: min = 0.394409 ms, max = 0.432129 ms, mean = 0.407958 ms, median = 0.405029 ms, percentile(90%) = 0.420532 ms, percentile(95%) = 0.42218 ms, percentile(99%) = 0.4245 ms
[12/18/2024-11:19:45] [I] GPU Compute Time: min = 5.72894 ms, max = 6.75439 ms, mean = 6.4672 ms, median = 6.67273 ms, percentile(90%) = 6.69897 ms, percentile(95%) = 6.70667 ms, percentile(99%) = 6.72021 ms
[12/18/2024-11:19:45] [I] D2H Latency: min = 0.0395508 ms, max = 0.0570068 ms, mean = 0.0465328 ms, median = 0.0446777 ms, percentile(90%) = 0.0522461 ms, percentile(95%) = 0.0541992 ms, percentile(99%) = 0.0563965 ms
[12/18/2024-11:19:45] [I] Total Host Walltime: 3.02188 s
[12/18/2024-11:19:45] [I] Total GPU Compute Time: 3.01372 s
[12/18/2024-11:19:45] [W] * GPU compute time is unstable, with coefficient of variance = 6.03109%.
[12/18/2024-11:19:45] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[12/18/2024-11:19:45] [I] Explanations of the performance metrics are printed in the verbose logs.
[12/18/2024-11:19:45] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=./plate_rec_color_small_dla.onnx_b12_gpu0_dla0_fp16.engine

There are lots of nodes and layers not supported by DLA, so these layers fallback to GPU. The mixed dla and gpu layers in the model will cause extra data trasferring and switching, the performance will not been improved in this way.

You may refer to Deep-Learning-Accelerator-SW/operators/README.md at main · NVIDIA/Deep-Learning-Accelerator-SW

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.