DLA performance is not as expected

Device: Jetson AGX Orin 64GB
Environment:
L4T 35.4.1
Jetpack 5.1.2
cudla 3.12.1

I have converted my DLA model using standalone mode and deployed the model using cudla hybrid mode, with input of int8:hwc4 and output of fp16:dla_linear.

When I run the model on DLA separately, the cost time is about 35 ms. Compared with 20 ms on the GPU, it is in line with expectations.

However, when I run the full application, along with other models running on the GPU, the cost time of the model running on DLA increased to 50ms. This is contrary to my expectation, I think the model running on DLA should not be affected by the GPU. Notably, the application’s GPU usage peaked at over 95%. In addition, I expected that using dla would be able to offload the load from the GPU to DLA, thus speeding up GPU calculations, but the effect was minimal.

I want to understand the potential impacts between DLA runtime and GPU that cause the above results, and how to optimize DLA runtime and really reduce GPU load.

Thanks for your reply!

Hi,

Could you try to set the below environment variable to see if it helps?

$ export CUDA_DEVICE_MAX_CONNECTIONS=32

More details can be found in this link.
Thanks.

Thanks for your replay!

I have tried to set the environment variable with 16 & 32. The cost time of the model on DLA even increased, and the total cost time of my full application also increased.

My goal is to reduce the total cost time of my application by offloading a model to DLA. Obviously the result is opposite to what I expected, is there anything else I need to pay attention to in the deployment of DLA?

To add to that, the DLA model use cuDLA’s hybrid mode for deployment. The DLA model runs in its own separate stream, where a single loop within the stream includes three CUDA kernels and the DLA task. This entails transforming input data to a GPU pointer registered with the DLA, executing the DLA task, transferring data from the DLA registered output pointer to the GPU pointer, and parsing the output data.

Hi,

Have you maximized the device’s performance?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

If you still see the performance issue after boosting the Orin, please help to a reproducible source so we can give it a check.
Thanks.