DLA Inference Latency Issue on Orin Platform

Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.8.1

Target Operating System
Linux

Hardware Platform
DRIVE AGX Orin Developer Kit (not sure its number)

SDK Manager Version
other

Host Machine Version
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers

Issue Description

I am testing DLA model inference on the Orin platform and observed the following issues. I would greatly appreciate any insights or suggestions.

Environment:

  • Model type: CNN (all layers mapped to DLA, no GPU fallback)

Issues:

  1. Inference Latency Increases with Concurrent GPU Workloads
    When I start another program that runs a model on the GPU, the inference latency of the DLA model increases. Nsight analysis shows that the intervals between DLA tasks become longer.

    • Why does this task interval increase?

    • I also tried using cudaHostAlloc with cudaHostAllocDefault buffers, but the behavior remains the same.

  2. Model Compiled into Multiple Subgraphs
    After compiling the CNN model for DLA, TensorRT partitions it into three subgraphs.

    • Why does TensorRT split the model into multiple subgraphs when targeting DLA?

    • Is there a way to avoid such partitioning and keep the model in a single DLA graph?

det_single.zip (4.6 MB)
**
Error String**

Logs

Dear @qiuwen ,
Did you test your model with trtexec. Is it possible to share the model here or via private message?
The model could be divided into subgraph due to memory limitations. You can enable verbose flag to get more info with trtexec

Dear SivaRamaKrishnaNV,

  1. Inference Latency Increases with Concurrent GPU Workloads
    While profiling with NSight during DLA inference, I observed that once GPU inference is launched, the GPU context appears to be switching frequently. Could this context switching be a potential cause of the increased latency observed when running GPU and DLA models concurrently?

    Additionally, on the DLA side, the idle gaps observed between inference tasks seem to suggest scheduling delays. Could you clarify whether these gaps are indeed due to task scheduling, or if there might be other underlying reasons?

    If we perform inference on the DLA using the cuDLA API with a DLA loadable instead of a TensorRT engine, would this eliminate the GPU context entirely and potentially avoid the frequent GPU context switching observed when running GPU and DLA inference concurrently?

    bev_seg_det_dla_0829.zip (22.5 MB)

  2. Model Compiled into Multiple Subgraphs
    I am sharing the log obtained after performing model quantization with trtexec, which shows that the model was partitioned into three subgraphs. Could you help explain the reason behind this partitioning, and advise on how the configuration could be adjusted so that the quantized model generates a single subgraph?

    trtexec_0829.log (619.2 KB)

Dear SivaRamaKrishnaNV,

Hi, I have uploaded the full log in my previous post. Could you please help to analyze the log and point out the possible issues?
Thanks a lot for your support.

Dear @qiuwen ,
Currently, we have small dependency in cuDLA with GPU and except to introduce delay in DLA execution pipeline when another task is launched on GPU.
Regarding model partition into subgraph, could you check increasing DLA memory params and see if number of subgraphs reduces.
It would be great if you can share a dummy model to repro the issue to get more insights from core team.

1 Like

Dear SivaRamaKrishnaNV,

I’ve attached the simplified model, calibration table, and the trtexec quantization log.
The model has three down-sampling scales:

  • 32x: 5 heads

  • 8x: 5 heads

  • 16x: 19 heads

    After quantization it still gets split into 3 subgraphs. Could you help take a look and share some insights on why it’s being split this way, and how we might adjust things to generate a single subgraph instead?

det_minimal_cache.zip (1.1 KB)

trtexec.log (264.3 KB)

det_minimal_onnx.zip (641.1 KB)

Dear @qiuwen ,
Thanks for sharing the model. We will repro the issue and get back to you.

Dear @qiuwen ,
I could repro the issue and looks like a bug in DRIVE OS 6.0.10

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.