Hello
We want to run a model on DLA0, DLA1, and GPU on Nvidia Jetson AGX Orin 64GB. We are following the below thread for the implementation.
Link:- DLA and GPU cores at the same time - #17 by angless
Once we run the inference on all DLA0, DLA1, and GPU, we are observing a very low throughput but the utilization is maximum.
Can someone suggest what could be the reason for the drop in throughput?
Hi,
Have you maximized the device performance first?
$ sudo nvpmodel -m 0
$ sudo jetson_clocks
Thanks.
Yer. The device is at maximum capacity.
Hi,
Could you share the benchmark data with us as well?
Thanks.
Sure. I have attached the screenshot which includes FPS and Jetson Power GUI information.
Hi,
Have you tried to infer the model with INT8 mode?
If not, would you mind giving it a try?
More, could you try to enlarge the CUDA wait queue size to see if it helps?
https://docs.nvidia.com/deploy/mps/index.html#topic_5_2_4
export CUDA_DEVICE_MAX_CONNECTIONS=32
Thanks.
system
Closed
November 29, 2022, 6:48am
9
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.
ramc
February 14, 2023, 3:59pm
10
Also check out the DLA github page for samples and resources or to report issues: Recipes and tools for running deep learning workloads on NVIDIA DLA cores for inference applications.
We have a FAQ page that addresses some common questions that we see developers run into: Deep-Learning-Accelerator-SW/FAQ