Any performance benefits in using directly cuDLA instead of TensorRT?

lab2022 · February 6, 2023, 2:55pm

Hi,

I am currently working with TensorRT with DLA on Jetson Orin Dev Kit. I have some question on which workflow to use. From what I have understood, two possible workflows exist to use DLA for inference :

Compilaton with TRT Builder, runtime with TRT Runtime
Compilaton with TRT Builder, runtime with cuDLA API

Here are my questions :

Is there another way to create DLA loaders without TensorRT builder?
cuDLA API exposes mechanisms to manage devices, memory and submit DLA tasks. In terms of performance benefits, is there a big difference between the strategy offered by TensorRT and the one we could create with cuDLA?
Same question for hybrid and standalone DLA inference.

Thank you very much for your help,

Environment

TensorRT Version: 8.4.1
GPU Type: Embedded Jetson Orin
CUDA Version: 11.4

spolisetty · February 8, 2023, 4:57pm

Hi,

Please check the below links, as they might answer your concerns.

For further assistance, we are moving this post to the Jetson Orin forum to get better help.

Thanks!

ramc · February 9, 2023, 4:10pm

The only way to create DLA loadables is using TRT Builder.
No, at this moment there is no significant perf improvement in using cuDLA API or TRT Runtime since TRT runtime using cuDLA underneath the hood. This is true both for standalone and hybrid scenarios.
Please check out the DLA github page for samples and resources or to report issues: GitHub - NVIDIA/Deep-Learning-Accelerator-SW: NVIDIA DLA-SW, the recipes and tools for running deep learning workloads on NVIDIA DLA cores for inference applications.
Also here are the cuDLA samples for reference:
https://github.com/NVIDIA/cuda-samples/tree/master/Samples/4_CUDA_Libraries/cuDLAStandaloneMode
cuda-samples/Samples/4_CUDA_Libraries/cuDLAHybridMode at master · NVIDIA/cuda-samples · GitHub

system · February 23, 2023, 4:10pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Inference time of cuDLA on Jetson AGX Orin Jetson AGX Orin yolo , dla , jetson-orin	4	35	December 19, 2024
Deploying YOLOv5 on NVIDIA Jetson Orin with cuDLA: Quantization-Aware Training to Inference Technical Blog	0	448	August 31, 2023
Cudamemcpy and other cuda operations effecting performance on tensorrt execution Jetson Orin NX cuda	5	51	December 9, 2024
DLA / GPU question Jetson AGX Xavier dla	6	929	October 18, 2021
Tensorrt Python API has a bug in DLA usage Jetson AGX Xavier tensorrt	11	626	August 17, 2022
Pre-process of fp32 type inputs fed in cuDLA application Jetson AGX Orin tensorrt , jetson-inference , dla	9	839	April 26, 2023
DLA-v2 is slower than DLA-v1 Jetson AGX Orin tensorrt , jetson-inference	8	2478	July 6, 2022
Quick Start with NVDLA on Jetson Orin NX Jetson Orin NX dla	4	1759	July 25, 2023
How to make context on DLA? Jetson Xavier NX dla	6	599	November 27, 2023
Configuring multiple versions of TensorRT and Tensorflow on HPC share cluster; TF-TRT Warning: Cannot dlopen some TensorRT libraries TensorRT	8	12679	June 28, 2023

Any performance benefits in using directly cuDLA instead of TensorRT?

Environment

Related topics