Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.8.1
Target Operating System
Linux
Hardware Platform
DRIVE AGX Orin Developer Kit (not sure its number)
SDK Manager Version
other
Host Machine Version
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
Issue Description
I am testing DLA model inference on the Orin platform and observed the following issues. I would greatly appreciate any insights or suggestions.
Environment:
- Model type: CNN (all layers mapped to DLA, no GPU fallback)
Issues:
-
Inference Latency Increases with Concurrent GPU Workloads
When I start another program that runs a model on the GPU, the inference latency of the DLA model increases. Nsight analysis shows that the intervals between DLA tasks become longer.-
Why does this task interval increase?
-
I also tried using
cudaHostAllocwithcudaHostAllocDefaultbuffers, but the behavior remains the same.
-
-
Model Compiled into Multiple Subgraphs
After compiling the CNN model for DLA, TensorRT partitions it into three subgraphs.-
Why does TensorRT split the model into multiple subgraphs when targeting DLA?
-
Is there a way to avoid such partitioning and keep the model in a single DLA graph?
-
det_single.zip (4.6 MB)
**
Error String**
Logs


