Hello,I’m the one of team,this problem is made in ROS . When we use ROS to public image, and inference model in DLA. It will meet "[TRT] [E] 1: [cudlaUtils.cpp::submit::95] Error Code 1: DLA (Falied to submit program to DLA engine.). Meanwhile , our python code maybe has error. We use python API of DLA, this is in the code file.Please check the file named ‘infer_ros.py’ . And please help us solve it. ros_dla_test.zip (59.5 MB)
Yes, I deply the model with DLA via trtesec.And I get can get truth result.I think the problem is created in Python API.Please check ‘infer_ros.py’ . This is Trtexec’s log .
&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=model_best.onnx --useDLACore=0 --allowGPUFallback
[10/14/2024-14:50:48] [I] === Model Options ===
[10/14/2024-14:50:48] [I] Format: ONNX
[10/14/2024-14:50:48] [I] Model: model_best.onnx
[10/14/2024-14:50:48] [I] Output:
[10/14/2024-14:50:48] [I] === Build Options ===
[10/14/2024-14:50:48] [I] Max batch: explicit batch
[10/14/2024-14:50:48] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/14/2024-14:50:48] [I] minTiming: 1
[10/14/2024-14:50:48] [I] avgTiming: 8
[10/14/2024-14:50:48] [I] Precision: FP32
[10/14/2024-14:50:48] [I] LayerPrecisions:
[10/14/2024-14:50:48] [I] Calibration:
[10/14/2024-14:50:48] [I] Refit: Disabled
[10/14/2024-14:50:48] [I] Sparsity: Disabled
[10/14/2024-14:50:48] [I] Safe mode: Disabled
[10/14/2024-14:50:48] [I] DirectIO mode: Disabled
[10/14/2024-14:50:48] [I] Restricted mode: Disabled
[10/14/2024-14:50:48] [I] Build only: Disabled
[10/14/2024-14:50:48] [I] Save engine:
[10/14/2024-14:50:48] [I] Load engine:
[10/14/2024-14:50:48] [I] Profiling verbosity: 0
[10/14/2024-14:50:48] [I] Tactic sources: Using default tactic sources
[10/14/2024-14:50:48] [I] timingCacheMode: local
[10/14/2024-14:50:48] [I] timingCacheFile:
[10/14/2024-14:50:48] [I] Heuristic: Disabled
[10/14/2024-14:50:48] [I] Preview Features: Use default preview flags.
[10/14/2024-14:50:48] [I] Input(s)s format: fp32:CHW
[10/14/2024-14:50:48] [I] Output(s)s format: fp32:CHW
[10/14/2024-14:50:48] [I] Input build shapes: model
[10/14/2024-14:50:48] [I] Input calibration shapes: model
[10/14/2024-14:50:48] [I] === System Options ===
[10/14/2024-14:50:48] [I] Device: 0
[10/14/2024-14:50:48] [I] DLACore: 0(With GPU fallback)
[10/14/2024-14:50:48] [I] Plugins:
[10/14/2024-14:50:48] [I] === Inference Options ===
[10/14/2024-14:50:48] [I] Batch: Explicit
[10/14/2024-14:50:48] [I] Input inference shapes: model
[10/14/2024-14:50:48] [I] Iterations: 10
[10/14/2024-14:50:48] [I] Duration: 3s (+ 200ms warm up)
[10/14/2024-14:50:48] [I] Sleep time: 0ms
[10/14/2024-14:50:48] [I] Idle time: 0ms
[10/14/2024-14:50:48] [I] Streams: 1
[10/14/2024-14:50:48] [I] ExposeDMA: Disabled
[10/14/2024-14:50:48] [I] Data transfers: Enabled
[10/14/2024-14:50:48] [I] Spin-wait: Disabled
[10/14/2024-14:50:48] [I] Multithreading: Disabled
[10/14/2024-14:50:48] [I] CUDA Graph: Disabled
[10/14/2024-14:50:48] [I] Separate profiling: Disabled
[10/14/2024-14:50:48] [I] Time Deserialize: Disabled
[10/14/2024-14:50:48] [I] Time Refit: Disabled
[10/14/2024-14:50:48] [I] NVTX verbosity: 0
[10/14/2024-14:50:48] [I] Persistent Cache Ratio: 0
[10/14/2024-14:50:48] [I] Inputs:
[10/14/2024-14:50:48] [I] === Reporting Options ===
[10/14/2024-14:50:48] [I] Verbose: Disabled
[10/14/2024-14:50:48] [I] Averages: 10 inferences
[10/14/2024-14:50:48] [I] Percentiles: 90,95,99
[10/14/2024-14:50:48] [I] Dump refittable layers:Disabled
[10/14/2024-14:50:48] [I] Dump output: Disabled
[10/14/2024-14:50:48] [I] Profile: Disabled
[10/14/2024-14:50:48] [I] Export timing to JSON file:
[10/14/2024-14:50:48] [I] Export output to JSON file:
[10/14/2024-14:50:48] [I] Export profile to JSON file:
[10/14/2024-14:50:48] [I]
[10/14/2024-14:50:48] [I] === Device Information ===
[10/14/2024-14:50:48] [I] Selected Device: Orin
[10/14/2024-14:50:48] [I] Compute Capability: 8.7
[10/14/2024-14:50:48] [I] SMs: 8
[10/14/2024-14:50:48] [I] Compute Clock Rate: 0.918 GHz
[10/14/2024-14:50:48] [I] Device Global Memory: 15523 MiB
[10/14/2024-14:50:48] [I] Shared Memory per SM: 164 KiB
[10/14/2024-14:50:48] [I] Memory Bus Width: 256 bits (ECC disabled)
[10/14/2024-14:50:48] [I] Memory Clock Rate: 0.918 GHz
[10/14/2024-14:50:48] [I]
[10/14/2024-14:50:48] [I] TensorRT version: 8.5.2
[10/14/2024-14:50:49] [I] [TRT] [MemUsageChange] Init CUDA: CPU +220, GPU +0, now: CPU 249, GPU 4720 (MiB)
[10/14/2024-14:50:51] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +302, GPU +285, now: CPU 574, GPU 5026 (MiB)
[10/14/2024-14:50:51] [I] Start parsing network model
[10/14/2024-14:50:51] [I] [TRT] ----------------------------------------------------------------
[10/14/2024-14:50:51] [I] [TRT] Input filename: model_best.onnx
[10/14/2024-14:50:51] [I] [TRT] ONNX IR version: 0.0.6
[10/14/2024-14:50:51] [I] [TRT] Opset version: 11
[10/14/2024-14:50:51] [I] [TRT] Producer name: pytorch
[10/14/2024-14:50:51] [I] [TRT] Producer version: 2.1.0
[10/14/2024-14:50:51] [I] [TRT] Domain:
[10/14/2024-14:50:51] [I] [TRT] Model version: 0
[10/14/2024-14:50:51] [I] [TRT] Doc string:
[10/14/2024-14:50:51] [I] [TRT] ----------------------------------------------------------------
[10/14/2024-14:50:51] [I] Finish parsing network model
[10/14/2024-14:50:58] [I] [TRT] ---------- Layers Running on DLA ----------
[10/14/2024-14:50:58] [I] [TRT] [DlaLayer] {ForeignNode[/conv1/Conv…/hp_offset/hp_offset.2/Conv]}
[10/14/2024-14:50:58] [I] [TRT] ---------- Layers Running on GPU ----------
[10/14/2024-14:50:59] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +534, GPU +349, now: CPU 1173, GPU 5601 (MiB)
[10/14/2024-14:51:00] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +82, GPU +76, now: CPU 1255, GPU 5677 (MiB)
[10/14/2024-14:51:00] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[10/14/2024-14:51:53] [I] [TRT] Total Activation Memory: 16298803200
[10/14/2024-14:51:53] [I] [TRT] Detected 1 inputs and 9 output network tensors.
[10/14/2024-14:51:54] [I] [TRT] Total Host Persistent Memory: 160
[10/14/2024-14:51:54] [I] [TRT] Total Device Persistent Memory: 0
[10/14/2024-14:51:54] [I] [TRT] Total Scratch Memory: 0
[10/14/2024-14:51:54] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 34 MiB, GPU 87 MiB
[10/14/2024-14:51:54] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 10 steps to complete.
[10/14/2024-14:51:54] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.135043ms to assign 10 blocks to 10 nodes requiring 21602304 bytes.
[10/14/2024-14:51:54] [I] [TRT] Total Activation Memory: 21602304
[10/14/2024-14:51:54] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +34, GPU +0, now: CPU 34, GPU 0 (MiB)
[10/14/2024-14:51:54] [I] Engine built in 65.5897 sec.
[10/14/2024-14:51:54] [I] [TRT] Loaded engine size: 34 MiB
[10/14/2024-14:51:54] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +34, GPU +0, now: CPU 34, GPU 0 (MiB)
[10/14/2024-14:51:54] [I] Engine deserialized in 0.00830032 sec.
[10/14/2024-14:51:54] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +20, now: CPU 34, GPU 20 (MiB)
[10/14/2024-14:51:54] [I] Setting persistentCacheLimit to 0 bytes.
[10/14/2024-14:51:54] [I] Using random values for input input.1
[10/14/2024-14:51:54] [I] Created input binding for input.1 with dimensions 3x3x576x960
[10/14/2024-14:51:54] [I] Using random values for output 252
[10/14/2024-14:51:54] [I] Created output binding for 252 with dimensions 3x3x158x254
[10/14/2024-14:51:54] [I] Using random values for output 255
[10/14/2024-14:51:54] [I] Created output binding for 255 with dimensions 3x2x158x254
[10/14/2024-14:51:54] [I] Using random values for output 258
[10/14/2024-14:51:54] [I] Created output binding for 258 with dimensions 3x18x158x254
[10/14/2024-14:51:54] [I] Using random values for output 261
[10/14/2024-14:51:54] [I] Created output binding for 261 with dimensions 3x8x158x254
[10/14/2024-14:51:54] [I] Using random values for output 264
[10/14/2024-14:51:54] [I] Created output binding for 264 with dimensions 3x3x158x254
[10/14/2024-14:51:54] [I] Using random values for output 267
[10/14/2024-14:51:54] [I] Created output binding for 267 with dimensions 3x1x158x254
[10/14/2024-14:51:54] [I] Using random values for output 270
[10/14/2024-14:51:54] [I] Created output binding for 270 with dimensions 3x2x158x254
[10/14/2024-14:51:54] [I] Using random values for output 273
[10/14/2024-14:51:54] [I] Created output binding for 273 with dimensions 3x9x158x254
[10/14/2024-14:51:54] [I] Using random values for output 276
[10/14/2024-14:51:54] [I] Created output binding for 276 with dimensions 3x2x158x254
[10/14/2024-14:51:54] [I] Starting inference
[10/14/2024-14:52:04] [I] Warmup completed 1 queries over 200 ms
[10/14/2024-14:52:04] [I] Timing trace has 10 queries over 10.4786 s
[10/14/2024-14:52:04] [I]
[10/14/2024-14:52:04] [I] === Trace details ===
[10/14/2024-14:52:04] [I] Trace averages of 10 runs:
[10/14/2024-14:52:04] [I] Average on 10 runs - GPU latency: 948.372 ms - Host latency: 954.412 ms (enqueue 0.48383 ms)
[10/14/2024-14:52:04] [I]
[10/14/2024-14:52:04] [I] === Performance summary ===
[10/14/2024-14:52:04] [I] Throughput: 0.954329 qps
[10/14/2024-14:52:04] [I] Latency: min = 953.681 ms, max = 954.573 ms, mean = 954.411 ms, median = 954.483 ms, percentile(90%) = 954.543 ms, percentile(95%) = 954.573 ms, percentile(99%) = 954.573 ms
[10/14/2024-14:52:04] [I] Enqueue Time: min = 0.291911 ms, max = 0.761475 ms, mean = 0.48383 ms, median = 0.471924 ms, percentile(90%) = 0.537598 ms, percentile(95%) = 0.761475 ms, percentile(99%) = 0.761475 ms
[10/14/2024-14:52:04] [I] H2D Latency: min = 2.53674 ms, max = 2.6488 ms, mean = 2.58658 ms, median = 2.58496 ms, percentile(90%) = 2.59375 ms, percentile(95%) = 2.6488 ms, percentile(99%) = 2.6488 ms
[10/14/2024-14:52:04] [I] GPU Compute Time: min = 948.333 ms, max = 948.418 ms, mean = 948.372 ms, median = 948.364 ms, percentile(90%) = 948.399 ms, percentile(95%) = 948.418 ms, percentile(99%) = 948.418 ms
[10/14/2024-14:52:04] [I] D2H Latency: min = 2.7334 ms, max = 3.54102 ms, mean = 3.45289 ms, median = 3.53223 ms, percentile(90%) = 3.54053 ms, percentile(95%) = 3.54102 ms, percentile(99%) = 3.54102 ms
[10/14/2024-14:52:04] [I] Total Host Walltime: 10.4786 s
[10/14/2024-14:52:04] [I] Total GPU Compute Time: 9.48372 s
[10/14/2024-14:52:04] [I] Explanations of the performance metrics are printed in the verbose logs.
[10/14/2024-14:52:04] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=model_best.onnx --useDLACore=0 --allowGPUFallback
The problem seems to have been solved. But there is a new problem, it is the model is quantized on DLA1, but the inference is executed on DLA0. Why is this happening? Or what parameters about DLA need to be set at inference?