DLA Inference Latency Issue on Orin Platform

Hello,

I am testing DLA model inference on the Orin platform and observed the following issues. I would greatly appreciate any insights or suggestions.

Environment:

  • Platform: Orin

  • SDK: DriveOS 6.0.8.1

  • Framework: TensorRT

  • Model type: CNN (all layers mapped to DLA, no GPU fallback)

Issues:

  1. Inference Latency Increases with Concurrent GPU Workloads
    When I start another program that runs a model on the GPU, the inference latency of the DLA model increases. Nsight analysis shows that the intervals between DLA tasks become longer.

    • Why does this task interval increase?

    • I also tried using cudaHostAlloc with cudaHostAllocDefault buffers, but the behavior remains the same.

  2. Model Compiled into Multiple Subgraphs
    After compiling the CNN model for DLA, TensorRT partitions it into three subgraphs.

    • Why does TensorRT split the model into multiple subgraphs when targeting DLA?

    • Is there a way to avoid such partitioning and keep the model in a single DLA graph?

det_single.zip (4.6 MB)

Thank you in advance for your help!

*** Please note that this reply is generated by LLM automatically ***

I’d be happy to help you with your questions about DLA model inference on the Orin platform.

Issue 1: Increased inference latency of DLA model when running another program on the GPU

The increased inference latency of the DLA model when running another program on the GPU is likely due to the shared resources between the DLA and GPU. Although the DLA is a separate hardware accelerator, it still shares some resources with the GPU, such as memory bandwidth and system bus bandwidth.

When another program is running on the GPU, it may be consuming some of these shared resources, leading to increased contention and latency for the DLA model. This is especially true if the GPU program is memory-intensive or has high system bus utilization.

To mitigate this issue, you can try to optimize the GPU program to reduce its memory bandwidth and system bus utilization. Additionally, you can try to use the --useDLACore flag with trtexec to specify which DLA core to use, as this may help to reduce contention between the DLA and GPU.

Issue 2: TensorRT partitioning the CNN model into multiple subgraphs when targeting DLA

TensorRT partitions the CNN model into multiple subgraphs when targeting DLA because of the limitations of the DLA hardware. The DLA is a fixed-function engine that is optimized for specific types of neural network layers, such as convolution, deconvolution, activation, pooling, local response normalization, and fully connected layers.

When a CNN model is compiled for the DLA, TensorRT analyzes the model and partitions it into subgraphs that can be executed on the DLA. This partitioning is done to ensure that each subgraph can be executed efficiently on the DLA, taking into account the DLA’s limitations and optimizations.

There are several reasons why TensorRT may partition a CNN model into multiple subgraphs when targeting DLA:

  1. Layer limitations: The DLA has limitations on the types of layers that can be executed on it. For example, the DLA does not support certain types of activation functions or normalization layers. When a CNN model contains layers that are not supported by the DLA, TensorRT may partition the model into subgraphs that can be executed on the DLA and those that cannot.
  2. Memory constraints: The DLA has limited memory resources, and large CNN models may not fit within these resources. TensorRT may partition the model into subgraphs that can be executed on the DLA within the available memory resources.
  3. Optimization: TensorRT may partition the model into subgraphs to optimize the execution of the model on the DLA. For example, TensorRT may group together layers that can be executed in parallel on the DLA, or it may split apart layers that have different execution requirements.

To avoid partitioning the model into multiple subgraphs, you can try to optimize the CNN model to make it more suitable for execution on the DLA. This may involve simplifying the model, reducing the number of layers, or using DLA-optimized layers.

Comparison with other platforms

The behavior you are observing on the Orin platform is similar to what you might see on other platforms that use the DLA, such as the Xavier or Nano platforms. However, the specific details of the partitioning and execution may vary depending on the platform and the CNN model being used.

In general, the DLA is a powerful hardware accelerator that can provide significant performance benefits for CNN models. However, it does require careful optimization and tuning to achieve the best results.

I hope this helps! Let me know if you have any further questions or if there’s anything else I can help with.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

This forum is for Jetson AGX Orin platform, and suppose your issue is related with DRIVE Orin platform, then please open a new topic at Latest DRIVE AGX Orin/DRIVE AGX Orin General topics - NVIDIA Developer Forums to get the support.

The DRIVE forum is exclusively for developers who are part of the NVIDIA DRIVE™ AGX SDK Developer Program. To post in the forum, please use an account associated with your corporate or university email address.
This helps us ensure that the forum remains a platform for verified members of the developer program.
Thanks.