Low performance while running model on DLA0, DLA1, and GPU at the same time on Jetson AGX Orin 64 GB

We want to run a model on DLA0, DLA1, and GPU on Nvidia Jetson AGX Orin 64GB. We are following the below thread for the implementation.
Link:- DLA and GPU cores at the same time - #17 by angless

Once we run the inference on all DLA0, DLA1, and GPU, we are observing a very low throughput but the utilization is maximum.
Can someone suggest what could be the reason for the drop in throughput?


Have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks


Yer. The device is at maximum capacity.


Could you share the benchmark data with us as well?

Sure. I have attached the screenshot which includes FPS and Jetson Power GUI information.


Have you tried to infer the model with INT8 mode?
If not, would you mind giving it a try?

More, could you try to enlarge the CUDA wait queue size to see if it helps?



Also check out the DLA github page for samples and resources or to report issues: Recipes and tools for running deep learning workloads on NVIDIA DLA cores for inference applications.

We have a FAQ page that addresses some common questions that we see developers run into: Deep-Learning-Accelerator-SW/FAQ