Any performance benefits in using directly cuDLA instead of TensorRT?


I am currently working with TensorRT with DLA on Jetson Orin Dev Kit. I have some question on which workflow to use. From what I have understood, two possible workflows exist to use DLA for inference :

  1. Compilaton with TRT Builder, runtime with TRT Runtime
  2. Compilaton with TRT Builder, runtime with cuDLA API

Here are my questions :

  • Is there another way to create DLA loaders without TensorRT builder?
  • cuDLA API exposes mechanisms to manage devices, memory and submit DLA tasks. In terms of performance benefits, is there a big difference between the strategy offered by TensorRT and the one we could create with cuDLA?
  • Same question for hybrid and standalone DLA inference.

Thank you very much for your help,


TensorRT Version: 8.4.1
GPU Type: Embedded Jetson Orin
CUDA Version: 11.4


Please check the below links, as they might answer your concerns.

For further assistance, we are moving this post to the Jetson Orin forum to get better help.


The only way to create DLA loadables is using TRT Builder.
No, at this moment there is no significant perf improvement in using cuDLA API or TRT Runtime since TRT runtime using cuDLA underneath the hood. This is true both for standalone and hybrid scenarios.
Please check out the DLA github page for samples and resources or to report issues: GitHub - NVIDIA/Deep-Learning-Accelerator-SW: NVIDIA DLA-SW, the recipes and tools for running deep learning workloads on NVIDIA DLA cores for inference applications.
Also here are the cuDLA samples for reference:
cuda-samples/Samples/4_CUDA_Libraries/cuDLAHybridMode at master · NVIDIA/cuda-samples · GitHub

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.