When we use the DLA of orin nx, the model inference may encounter the following errors after a period of time:[cudlaUtils.cpp::submit::95] Error Code 1: DLA (Failed to submit program to DLA engine.) think you very much !
Dear @user54829,
Are you using trtexec? If so, could you share the used command and model as well?
this is the command and model :
/usr/src/tensorrt/bin/trtexec --onnx=./model_best.onnx --saveEngine=model_best_int8.trt8 --int8 --workspace=4096 --useDLACore=0
model_best.zip (59.2 MB)
Dear @user54829,
Just want confirm if tested with latest release?
No, I’m still using the version I sent you before,could you help me to solve this problem?
I have tested with command /usr/src/tensorrt/bin/trtexec --onnx=/home/nvidia/model_best.onnx --saveEngine=/home/nvidia/model_best_int8.trt8 --int8 --workspace=4096 --useDLACore=0 --inputIOFormats=int8:dla_hwc4 --outputIOFormats=int8:chw32 --verbose
on Jetpack 6.0 and seems working
[07/30/2024-06:43:04] [I] === Performance summary ===
[07/30/2024-06:43:04] [I] Throughput: 13.7734 qps
[07/30/2024-06:43:04] [I] Latency: min = 75.7773 ms, max = 76.2686 ms, mean = 75.9188 ms, median = 75.8937 ms, percentile(90%) = 76.0683 ms, percentile(95%) = 76.129 ms, percentile(99%) = 76.2686 ms
[07/30/2024-06:43:04] [I] Enqueue Time: min = 0.182007 ms, max = 0.6315 ms, mean = 0.34453 ms, median = 0.349854 ms, percentile(90%) = 0.413147 ms, percentile(95%) = 0.432831 ms, percentile(99%) = 0.6315 ms
[07/30/2024-06:43:04] [I] H2D Latency: min = 0.781494 ms, max = 0.88147 ms, mean = 0.801546 ms, median = 0.796509 ms, percentile(90%) = 0.821777 ms, percentile(95%) = 0.831207 ms, percentile(99%) = 0.88147 ms
[07/30/2024-06:43:04] [I] GPU Compute Time: min = 70.892 ms, max = 71.4028 ms, mean = 71.025 ms, median = 71.0055 ms, percentile(90%) = 71.1526 ms, percentile(95%) = 71.2463 ms, percentile(99%) = 71.4028 ms
[07/30/2024-06:43:04] [I] D2H Latency: min = 4.0813 ms, max = 4.1084 ms, mean = 4.09228 ms, median = 4.0896 ms, percentile(90%) = 4.10669 ms, percentile(95%) = 4.10718 ms, percentile(99%) = 4.1084 ms
[07/30/2024-06:43:04] [I] Total Host Walltime: 3.26717 s
[07/30/2024-06:43:04] [I] Total GPU Compute Time: 3.19612 s
[07/30/2024-06:43:04] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/30/2024-06:43:04] [V]
[07/30/2024-06:43:04] [V] === Explanations of the performance metrics ===
[07/30/2024-06:43:04] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/30/2024-06:43:04] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/30/2024-06:43:04] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/30/2024-06:43:04] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/30/2024-06:43:04] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/30/2024-06:43:04] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/30/2024-06:43:04] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/30/2024-06:43:04] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/30/2024-06:43:04] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8602] # /usr/src/tensorrt/bin/trtexec --onnx=/home/nvidia/model_best.onnx --saveEngine=/home/nvidia/model_best_int8.trt8 --int8 --workspace=4096 --useDLACore=0 --inputIOFormats=int8:dla_hwc4 --outputIOFormats=int8:chw32 --verbose
nvidia@tegra-ubuntu:~$ head -1 /etc/nv_tegra_release
# R36 (release), REVISION: 3.0, GCID: 36191598, BOARD: generic, EABI: aarch64, DATE: Mon May 6 17:34:21 UTC 2024
ros_dla_test.zip (59.5 MB)
hello,I have tested with your command.It can be converted to int8 on DLA. But when I use model_best.trt to do inference , the problem arises again. I upload the inference code with two ways, onnx_model and data are all in the compress bag.Thank you very much!
Hello,have you tested it on the demo we provided? Is this issue still being followed up on? Looking forward to your reply very much!
NO issue with trtexec when loading model
nvidia@tegra-ubuntu:~$ /usr/src/tensorrt/bin/trtexec --loadEngine=model_best_int8.trt8 --verbose --int8
&&&& RUNNING TensorRT.trtexec [TensorRT v8602] # /usr/src/tensorrt/bin/trtexec --loadEngine=model_best_int8.trt8 --verbose --int8
[08/14/2024-01:55:03] [I] === Model Options ===
[08/14/2024-01:55:03] [I] Format: *
[08/14/2024-01:55:03] [I] Model:
[08/14/2024-01:55:03] [I] Output:
[08/14/2024-01:55:03] [I] === Build Options ===
[08/14/2024-01:55:03] [I] Max batch: 1
[08/14/2024-01:55:03] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[08/14/2024-01:55:03] [I] minTiming: 1
[08/14/2024-01:55:03] [I] avgTiming: 8
[08/14/2024-01:55:03] [I] Precision: FP32+INT8
[08/14/2024-01:55:03] [I] LayerPrecisions:
[08/14/2024-01:55:03] [I] Layer Device Types:
[08/14/2024-01:55:03] [I] Calibration: Dynamic
[08/14/2024-01:55:03] [I] Refit: Disabled
[08/14/2024-01:55:03] [I] Version Compatible: Disabled
[08/14/2024-01:55:03] [I] ONNX Native InstanceNorm: Disabled
[08/14/2024-01:55:03] [I] TensorRT runtime: full
[08/14/2024-01:55:03] [I] Lean DLL Path:
[08/14/2024-01:55:03] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[08/14/2024-01:55:03] [I] Exclude Lean Runtime: Disabled
[08/14/2024-01:55:03] [I] Sparsity: Disabled
[08/14/2024-01:55:03] [I] Safe mode: Disabled
[08/14/2024-01:55:03] [I] Build DLA standalone loadable: Disabled
[08/14/2024-01:55:03] [I] Allow GPU fallback for DLA: Disabled
[08/14/2024-01:55:03] [I] DirectIO mode: Disabled
[08/14/2024-01:55:03] [I] Restricted mode: Disabled
[08/14/2024-01:55:03] [I] Skip inference: Disabled
[08/14/2024-01:55:03] [I] Save engine:
[08/14/2024-01:55:03] [I] Load engine: model_best_int8.trt8
[08/14/2024-01:55:03] [I] Profiling verbosity: 0
[08/14/2024-01:55:03] [I] Tactic sources: Using default tactic sources
[08/14/2024-01:55:03] [I] timingCacheMode: local
[08/14/2024-01:55:03] [I] timingCacheFile:
[08/14/2024-01:55:03] [I] Heuristic: Disabled
[08/14/2024-01:55:03] [I] Preview Features: Use default preview flags.
[08/14/2024-01:55:03] [I] MaxAuxStreams: -1
[08/14/2024-01:55:03] [I] BuilderOptimizationLevel: -1
[08/14/2024-01:55:03] [I] Input(s)s format: fp32:CHW
[08/14/2024-01:55:03] [I] Output(s)s format: fp32:CHW
[08/14/2024-01:55:03] [I] Input build shapes: model
[08/14/2024-01:55:03] [I] Input calibration shapes: model
[08/14/2024-01:55:03] [I] === System Options ===
[08/14/2024-01:55:03] [I] Device: 0
[08/14/2024-01:55:03] [I] DLACore:
[08/14/2024-01:55:03] [I] Plugins:
[08/14/2024-01:55:03] [I] setPluginsToSerialize:
[08/14/2024-01:55:03] [I] dynamicPlugins:
[08/14/2024-01:55:03] [I] ignoreParsedPluginLibs: 0
[08/14/2024-01:55:03] [I]
[08/14/2024-01:55:03] [I] === Inference Options ===
[08/14/2024-01:55:03] [I] Batch: 1
[08/14/2024-01:55:03] [I] Input inference shapes: model
[08/14/2024-01:55:03] [I] Iterations: 10
[08/14/2024-01:55:03] [I] Duration: 3s (+ 200ms warm up)
[08/14/2024-01:55:03] [I] Sleep time: 0ms
[08/14/2024-01:55:03] [I] Idle time: 0ms
[08/14/2024-01:55:03] [I] Inference Streams: 1
[08/14/2024-01:55:03] [I] ExposeDMA: Disabled
[08/14/2024-01:55:03] [I] Data transfers: Enabled
[08/14/2024-01:55:03] [I] Spin-wait: Disabled
[08/14/2024-01:55:03] [I] Multithreading: Disabled
[08/14/2024-01:55:03] [I] CUDA Graph: Disabled
[08/14/2024-01:55:03] [I] Separate profiling: Disabled
[08/14/2024-01:55:03] [I] Time Deserialize: Disabled
[08/14/2024-01:55:03] [I] Time Refit: Disabled
[08/14/2024-01:55:03] [I] NVTX verbosity: 0
[08/14/2024-01:55:03] [I] Persistent Cache Ratio: 0
[08/14/2024-01:55:03] [I] Inputs:
[08/14/2024-01:55:03] [I] === Reporting Options ===
[08/14/2024-01:55:03] [I] Verbose: Enabled
[08/14/2024-01:55:03] [I] Averages: 10 inferences
[08/14/2024-01:55:03] [I] Percentiles: 90,95,99
[08/14/2024-01:55:03] [I] Dump refittable layers:Disabled
[08/14/2024-01:55:03] [I] Dump output: Disabled
[08/14/2024-01:55:03] [I] Profile: Disabled
[08/14/2024-01:55:03] [I] Export timing to JSON file:
[08/14/2024-01:55:03] [I] Export output to JSON file:
[08/14/2024-01:55:03] [I] Export profile to JSON file:
[08/14/2024-01:55:03] [I]
[08/14/2024-01:55:03] [I] === Device Information ===
[08/14/2024-01:55:03] [I] Selected Device: Orin
[08/14/2024-01:55:03] [I] Compute Capability: 8.7
[08/14/2024-01:55:03] [I] SMs: 8
[08/14/2024-01:55:03] [I] Device Global Memory: 15656 MiB
[08/14/2024-01:55:03] [I] Shared Memory per SM: 164 KiB
[08/14/2024-01:55:03] [I] Memory Bus Width: 256 bits (ECC disabled)
[08/14/2024-01:55:03] [I] Application Compute Clock Rate: 0.918 GHz
[08/14/2024-01:55:03] [I] Application Memory Clock Rate: 0.918 GHz
[08/14/2024-01:55:03] [I]
[08/14/2024-01:55:03] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[08/14/2024-01:55:03] [I]
[08/14/2024-01:55:03] [I] TensorRT version: 8.6.2
[08/14/2024-01:55:03] [I] Loading standard plugins
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::BatchTilePlugin_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::CoordConvAC version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::CropAndResizeDynamic version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::DecodeBbox3DPlugin version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::EfficientNMS_Explicit_TF_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::EfficientNMS_Implicit_TF_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::GenerateDetection_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 2
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::ModulatedDeformConv2d version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::MultilevelCropAndResize_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::MultilevelProposeROI_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::MultiscaleDeformableAttnPlugin_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::NMSDynamic_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::PillarScatterPlugin version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::ProposalDynamic version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::Proposal version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::ROIAlign_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::ScatterND version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::Split version 1
[08/14/2024-01:55:03] [V] [TRT] Registered plugin creator - ::VoxelGeneratorPlugin version 1
[08/14/2024-01:55:03] [I] Engine loaded in 0.0112724 sec.
[08/14/2024-01:55:03] [I] [TRT] Loaded engine size: 17 MiB
[08/14/2024-01:55:03] [V] [TRT] Deserialization required 5573 microseconds.
[08/14/2024-01:55:03] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +17, GPU +0, now: CPU 17, GPU 0 (MiB)
[08/14/2024-01:55:03] [I] Engine deserialized in 0.0148129 sec.
[08/14/2024-01:55:03] [V] [TRT] Total per-runner device persistent memory is 0
[08/14/2024-01:55:03] [V] [TRT] Total per-runner host persistent memory is 160
[08/14/2024-01:55:03] [V] [TRT] Allocated activation device memory of size 0
[08/14/2024-01:55:03] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 17, GPU 0 (MiB)
[08/14/2024-01:55:03] [V] [TRT] CUDA lazy loading is enabled.
[08/14/2024-01:55:03] [I] Setting persistentCacheLimit to 0 bytes.
[08/14/2024-01:55:03] [V] Using enqueueV3.
[08/14/2024-01:55:03] [I] Using random values for input input.1
[08/14/2024-01:55:03] [I] Input binding for input.1 with dimensions 3x3x576x960 is created.
[08/14/2024-01:55:03] [I] Output binding for 252 with dimensions 3x3x158x254 is created.
[08/14/2024-01:55:03] [I] Output binding for 255 with dimensions 3x2x158x254 is created.
[08/14/2024-01:55:03] [I] Output binding for 258 with dimensions 3x18x158x254 is created.
[08/14/2024-01:55:03] [I] Output binding for 261 with dimensions 3x8x158x254 is created.
[08/14/2024-01:55:03] [I] Output binding for 264 with dimensions 3x3x158x254 is created.
[08/14/2024-01:55:03] [I] Output binding for 267 with dimensions 3x1x158x254 is created.
[08/14/2024-01:55:03] [I] Output binding for 270 with dimensions 3x2x158x254 is created.
[08/14/2024-01:55:03] [I] Output binding for 273 with dimensions 3x9x158x254 is created.
[08/14/2024-01:55:03] [I] Output binding for 276 with dimensions 3x2x158x254 is created.
[08/14/2024-01:55:03] [I] Starting inference
[08/14/2024-01:55:06] [I] Warmup completed 3 queries over 200 ms
[08/14/2024-01:55:06] [I] Timing trace has 45 queries over 3.26789 s
[08/14/2024-01:55:06] [I]
[08/14/2024-01:55:06] [I] === Trace details ===
[08/14/2024-01:55:06] [I] Trace averages of 10 runs:
[08/14/2024-01:55:06] [I] Average on 10 runs - GPU latency: 70.9911 ms - Host latency: 75.884 ms (enqueue 0.355682 ms)
[08/14/2024-01:55:06] [I] Average on 10 runs - GPU latency: 71.0764 ms - Host latency: 75.9755 ms (enqueue 0.352441 ms)
[08/14/2024-01:55:06] [I] Average on 10 runs - GPU latency: 71.0421 ms - Host latency: 75.9447 ms (enqueue 0.352783 ms)
[08/14/2024-01:55:06] [I] Average on 10 runs - GPU latency: 71.0311 ms - Host latency: 75.9425 ms (enqueue 0.372925 ms)
[08/14/2024-01:55:06] [I]
[08/14/2024-01:55:06] [I] === Performance summary ===
[08/14/2024-01:55:06] [I] Throughput: 13.7704 qps
[08/14/2024-01:55:06] [I] Latency: min = 75.7847 ms, max = 76.7615 ms, mean = 75.9393 ms, median = 75.9025 ms, percentile(90%) = 76.0905 ms, percentile(95%) = 76.117 ms, percentile(99%) = 76.7615 ms
[08/14/2024-01:55:06] [I] Enqueue Time: min = 0.307373 ms, max = 0.447754 ms, mean = 0.360925 ms, median = 0.36377 ms, percentile(90%) = 0.4021 ms, percentile(95%) = 0.406982 ms, percentile(99%) = 0.447754 ms
[08/14/2024-01:55:06] [I] H2D Latency: min = 0.790466 ms, max = 0.912109 ms, mean = 0.807107 ms, median = 0.799561 ms, percentile(90%) = 0.831299 ms, percentile(95%) = 0.837158 ms, percentile(99%) = 0.912109 ms
[08/14/2024-01:55:06] [I] GPU Compute Time: min = 70.9006 ms, max = 71.8467 ms, mean = 71.0389 ms, median = 70.9785 ms, percentile(90%) = 71.1998 ms, percentile(95%) = 71.2307 ms, percentile(99%) = 71.8467 ms
[08/14/2024-01:55:06] [I] D2H Latency: min = 4.08081 ms, max = 4.11395 ms, mean = 4.09323 ms, median = 4.08911 ms, percentile(90%) = 4.10986 ms, percentile(95%) = 4.11121 ms, percentile(99%) = 4.11395 ms
[08/14/2024-01:55:06] [I] Total Host Walltime: 3.26789 s
[08/14/2024-01:55:06] [I] Total GPU Compute Time: 3.19675 s
[08/14/2024-01:55:06] [I] Explanations of the performance metrics are printed in the verbose logs.
[08/14/2024-01:55:06] [V]
[08/14/2024-01:55:06] [V] === Explanations of the performance metrics ===
[08/14/2024-01:55:06] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[08/14/2024-01:55:06] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[08/14/2024-01:55:06] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[08/14/2024-01:55:06] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[08/14/2024-01:55:06] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[08/14/2024-01:55:06] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[08/14/2024-01:55:06] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[08/14/2024-01:55:06] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[08/14/2024-01:55:06] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8602] # /usr/src/tensorrt/bin/trtexec --loadEngine=model_best_int8.trt8 --verbose --int8
NVMAP_IOC_PARAMETERS failed: No such device
NVMAP_IOC_PARAMETERS failed: No such device
NVMAP_IOC_PARAMETERS failed: No such device
NVMAP_IOC_PARAMETERS failed: No such device
NVMAP_IOC_PARAMETERS failed: No such device
NVMAP_IOC_PARAMETERS failed: No such device
NVMAP_IOC_PARAMETERS failed: No such device
NVMAP_IOC_PARAMETERS failed: No such device
NVMAP_IOC_PARAMETERS failed: No such device
NVMAP_IOC_PARAMETERS failed: No such device