DLA Inference Latency Issue on Orin Platform

qiuwen · August 28, 2025, 3:47am

Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.8.1

Target Operating System
Linux

Hardware Platform
DRIVE AGX Orin Developer Kit (not sure its number)

SDK Manager Version
other

Host Machine Version
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers

Issue Description

I am testing DLA model inference on the Orin platform and observed the following issues. I would greatly appreciate any insights or suggestions.

Environment:

Model type: CNN (all layers mapped to DLA, no GPU fallback)

Issues:

Inference Latency Increases with Concurrent GPU Workloads
When I start another program that runs a model on the GPU, the inference latency of the DLA model increases. Nsight analysis shows that the intervals between DLA tasks become longer.
- Why does this task interval increase?
- I also tried using cudaHostAlloc with cudaHostAllocDefault buffers, but the behavior remains the same.
Model Compiled into Multiple Subgraphs
After compiling the CNN model for DLA, TensorRT partitions it into three subgraphs.
- Why does TensorRT split the model into multiple subgraphs when targeting DLA?
- Is there a way to avoid such partitioning and keep the model in a single DLA graph?
  
  image1179×488 92.6 KB

det_single.zip (4.6 MB)
**
Error String**

Logs

SivaRamaKrishnaNV · August 28, 2025, 11:18am

Dear @qiuwen ,
Did you test your model with trtexec. Is it possible to share the model here or via private message?
The model could be divided into subgraph due to memory limitations. You can enable verbose flag to get more info with trtexec

qiuwen · August 29, 2025, 11:09am

Dear SivaRamaKrishnaNV,

Inference Latency Increases with Concurrent GPU Workloads
While profiling with NSight during DLA inference, I observed that once GPU inference is launched, the GPU context appears to be switching frequently. Could this context switching be a potential cause of the increased latency observed when running GPU and DLA models concurrently?

Additionally, on the DLA side, the idle gaps observed between inference tasks seem to suggest scheduling delays. Could you clarify whether these gaps are indeed due to task scheduling, or if there might be other underlying reasons?

If we perform inference on the DLA using the cuDLA API with a DLA loadable instead of a TensorRT engine, would this eliminate the GPU context entirely and potentially avoid the frequent GPU context switching observed when running GPU and DLA inference concurrently?

image1308×241 49.9 KB

bev_seg_det_dla_0829.zip (22.5 MB)
Model Compiled into Multiple Subgraphs
I am sharing the log obtained after performing model quantization with trtexec, which shows that the model was partitioned into three subgraphs. Could you help explain the reason behind this partitioning, and advise on how the configuration could be adjusted so that the quantized model generates a single subgraph?

image702×195 50.1 KB
trtexec_0829.log (619.2 KB)

qiuwen · September 5, 2025, 8:52am

Dear SivaRamaKrishnaNV,

Hi, I have uploaded the full log in my previous post. Could you please help to analyze the log and point out the possible issues?
Thanks a lot for your support.

SivaRamaKrishnaNV · September 8, 2025, 4:09pm

Dear @qiuwen ,
Currently, we have small dependency in cuDLA with GPU and except to introduce delay in DLA execution pipeline when another task is launched on GPU.
Regarding model partition into subgraph, could you check increasing DLA memory params and see if number of subgraphs reduces.
It would be great if you can share a dummy model to repro the issue to get more insights from core team.

qiuwen · September 10, 2025, 10:24am

Dear SivaRamaKrishnaNV,

I’ve attached the simplified model, calibration table, and the trtexec quantization log.
The model has three down-sampling scales:

32x: 5 heads
8x: 5 heads
16x: 19 heads

After quantization it still gets split into 3 subgraphs. Could you help take a look and share some insights on why it’s being split this way, and how we might adjust things to generate a single subgraph instead?

det_minimal_cache.zip (1.1 KB)

trtexec.log (264.3 KB)

det_minimal_onnx.zip (641.1 KB)

SivaRamaKrishnaNV · September 14, 2025, 7:34pm

Dear @qiuwen ,
Thanks for sharing the model. We will repro the issue and get back to you.

SivaRamaKrishnaNV · September 19, 2025, 4:49am

Dear @qiuwen ,
I could repro the issue and looks like a bug in DRIVE OS 6.0.10

system · October 13, 2025, 7:36pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
DLA Inference Latency Issue on Orin Platform Jetson AGX Orin dla	3	118	August 28, 2025
How does the TRT inference run on both DLA and GPUs? Jetson Orin NX tensorrt , dla	2	947	August 30, 2023
When GPU and DLA are used at the same time, the time consumption increases with each other DRIVE AGX Orin General dla , driveos-dl	10	984	March 9, 2023
How to use both DLA and GPU cores concurrently? Jetson AGX Orin dla	8	274	April 25, 2025
Why is the inference speed of DLA on agx orin much slower than that without DLA? TensorRT dla	1	89	March 28, 2025
Keys to optimization a network on AGX Orin DLA for latency Jetson AGX Orin tensorrt , dla	2	1044	October 6, 2023
Does DLA work faster than GPU in fp16 model? Jetson AGX Xavier dla	18	3175	June 8, 2022
How to boost trtexec's gps for 1DLA only? Jetson AGX Orin jetson-inference , dla	16	1303	April 26, 2023
Why yolox inference time with DLA is longer than without DLA ，81 ms vs 8 ms? Jetson AGX Orin dla	5	670	June 9, 2023
DLA1 and DLA0 not run in parallel, but in sequence TensorRT tensorrt , jetson	2	59	January 17, 2025

DLA Inference Latency Issue on Orin Platform

Related topics