Low performance while running model on DLA0, DLA1, and GPU at the same time on Jetson AGX Orin 64 GB

Hello
We want to run a model on DLA0, DLA1, and GPU on Nvidia Jetson AGX Orin 64GB. We are following the below thread for the implementation.
Link:- DLA and GPU cores at the same time - #17 by angless

Once we run the inference on all DLA0, DLA1, and GPU, we are observing a very low throughput but the utilization is maximum.
Can someone suggest what could be the reason for the drop in throughput?

Hi,

Have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

Yer. The device is at maximum capacity.

Hi,

Could you share the benchmark data with us as well?
Thanks.

Sure. I have attached the screenshot which includes FPS and Jetson Power GUI information.

Hi,

Have you tried to infer the model with INT8 mode?
If not, would you mind giving it a try?

More, could you try to enlarge the CUDA wait queue size to see if it helps?
https://docs.nvidia.com/deploy/mps/index.html#topic_5_2_4

export CUDA_DEVICE_MAX_CONNECTIONS=32

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Also check out the DLA github page for samples and resources or to report issues: Recipes and tools for running deep learning workloads on NVIDIA DLA cores for inference applications.

We have a FAQ page that addresses some common questions that we see developers run into: Deep-Learning-Accelerator-SW/FAQ