DLA performance is not as expected

Device: Jetson AGX Orin 64GB
Environment:
L4T 35.4.1
Jetpack 5.1.2
cudla 3.12.1

I have converted my DLA model using standalone mode and deployed the model using cudla hybrid mode, with input of int8:hwc4 and output of fp16:dla_linear.

When I run the model on DLA separately, the cost time is about 35 ms. Compared with 20 ms on the GPU, it is in line with expectations.

However, when I run the full application, along with other models running on the GPU, the cost time of the model running on DLA increased to 50ms. This is contrary to my expectation, I think the model running on DLA should not be affected by the GPU. Notably, the application’s GPU usage peaked at over 95%. In addition, I expected that using dla would be able to offload the load from the GPU to DLA, thus speeding up GPU calculations, but the effect was minimal.

I want to understand the potential impacts between DLA runtime and GPU that cause the above results, and how to optimize DLA runtime and really reduce GPU load.

Thanks for your reply!

Hi,

Could you try to set the below environment variable to see if it helps?

$ export CUDA_DEVICE_MAX_CONNECTIONS=32

More details can be found in this link.
Thanks.

Thanks for your replay!

I have tried to set the environment variable with 16 & 32. The cost time of the model on DLA even increased, and the total cost time of my full application also increased.

My goal is to reduce the total cost time of my application by offloading a model to DLA. Obviously the result is opposite to what I expected, is there anything else I need to pay attention to in the deployment of DLA?

To add to that, the DLA model use cuDLA’s hybrid mode for deployment. The DLA model runs in its own separate stream, where a single loop within the stream includes three CUDA kernels and the DLA task. This entails transforming input data to a GPU pointer registered with the DLA, executing the DLA task, transferring data from the DLA registered output pointer to the GPU pointer, and parsing the output data.

Hi,

Have you maximized the device’s performance?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

If you still see the performance issue after boosting the Orin, please help to a reproducible source so we can give it a check.
Thanks.

Thanks for your replay,

I have tried all the suggestions you provided earlier, but they didn’t work. Due to confidentiality mechanisms, I can’t provide the source code.

I used hybrid runtime code in GitHub - NVIDIA-AI-IOT/cuDLA-samples: YOLOv5 on Orin DLA to deploy and profile my dla model. Here is the redults.

07-25-2024 10:47:28 RAM 3636/30593MB (lfb 5031x4MB) SWAP 0/15296MB (cached 0MB) CPU [0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,off,off,off,off] EMC_FREQ 0%@665 GR3D_FREQ 0%@[0,0] VIC_FREQ 115 APE 174 CV0@-256C CPU@47.968C Tboard@36C SOC2@44.656C Tdiode@37.25C SOC0@45.031C CV1@-256C GPU@-256C tj@47.968C SOC1@43.875C CV2@-256C VDD_GPU_SOC 1945mW/1945mW VDD_CPU_CV 388mW/388mW VIN_SYS_5V0 2214mW/2214mW NC 0mW/0mW VDDQ_VDD2_1V8AO 201mW/201mW NC 0mW/0mW
07-25-2024 10:47:29 RAM 3636/30593MB (lfb 5031x4MB) SWAP 0/15296MB (cached 0MB) CPU [0%@729,0%@729,0%@729,0%@729,0%@729,1%@729,0%@729,0%@729,off,off,off,off] EMC_FREQ 0%@204 GR3D_FREQ 0%@[0,0] VIC_FREQ 115 APE 174 CV0@-256C CPU@47.875C Tboard@36C SOC2@44.75C Tdiode@37.5C SOC0@45.25C CV1@-256C GPU@-256C tj@47.875C SOC1@44C CV2@-256C VDD_GPU_SOC 1945mW/1945mW VDD_CPU_CV 388mW/388mW VIN_SYS_5V0 2217mW/2215mW NC 0mW/0mW VDDQ_VDD2_1V8AO 201mW/201mW NC 0mW/0mW
07-25-2024 10:47:30 RAM 3702/30593MB (lfb 5031x4MB) SWAP 0/15296MB (cached 0MB) CPU [4%@1728,12%@1728,0%@1728,0%@1728,0%@729,0%@729,0%@729,0%@729,off,off,off,off] EMC_FREQ 0%@665 GR3D_FREQ 11%@[305,0] VIC_FREQ 115 NVDLA0_FREQ @1369 APE 174 CV0@46.468C CPU@48.25C Tboard@36C SOC2@44.781C Tdiode@37.5C SOC0@45.031C CV1@45.531C GPU@42.906C tj@48.062C SOC1@44C CV2@42.437C VDD_GPU_SOC 3110mW/2333mW VDD_CPU_CV 1166mW/647mW VIN_SYS_5V0 2923mW/2451mW NC 0mW/0mW VDDQ_VDD2_1V8AO 504mW/302mW NC 0mW/0mW
07-25-2024 10:47:31 RAM 3702/30593MB (lfb 5031x4MB) SWAP 0/15296MB (cached 0MB) CPU [4%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,off,off,off,off] EMC_FREQ 9%@2133 GR3D_FREQ 56%@[305,0] VIC_FREQ 115 NVDLA0_FREQ @1369 APE 174 CV0@46.593C CPU@48.156C Tboard@36C SOC2@44.906C Tdiode@37.5C SOC0@45.375C CV1@45.5C GPU@42.625C tj@48.156C SOC1@44.125C CV2@42.687C VDD_GPU_SOC 3884mW/2721mW VDD_CPU_CV 1553mW/873mW VIN_SYS_5V0 4947mW/3075mW NC 0mW/0mW VDDQ_VDD2_1V8AO 1716mW/655mW NC 0mW/0mW
07-25-2024 10:47:32 RAM 3702/30593MB (lfb 5031x4MB) SWAP 0/15296MB (cached 0MB) CPU [0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,off,off,off,off] EMC_FREQ 14%@2133 GR3D_FREQ 55%@[305,0] VIC_FREQ 115 NVDLA0_FREQ @1369 APE 174 CV0@46.656C CPU@48.093C Tboard@36C SOC2@44.937C Tdiode@37.5C SOC0@45.187C CV1@45.75C GPU@42.843C tj@48.093C SOC1@44.312C CV2@42.468C VDD_GPU_SOC 3884mW/2953mW VDD_CPU_CV 1553mW/1009mW VIN_SYS_5V0 5249mW/3510mW NC 0mW/0mW VDDQ_VDD2_1V8AO 1918mW/908mW NC 0mW/0mW
07-25-2024 10:47:33 RAM 3702/30593MB (lfb 5031x4MB) SWAP 0/15296MB (cached 0MB) CPU [0%@729,1%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,off,off,off,off] EMC_FREQ 17%@2133 GR3D_FREQ 55%@[305,0] VIC_FREQ 115 NVDLA0_FREQ @1369 APE 174 CV0@47.156C CPU@48.218C Tboard@36C SOC2@45.125C Tdiode@37.5C SOC0@45.281C CV1@45.593C GPU@42.875C tj@48.218C SOC1@44.062C CV2@42.687C VDD_GPU_SOC 3884mW/3108mW VDD_CPU_CV 1553mW/1100mW VIN_SYS_5V0 5249mW/3799mW NC 0mW/0mW VDDQ_VDD2_1V8AO 1918mW/1076mW NC 0mW/0mW
07-25-2024 10:47:34 RAM 3702/30593MB (lfb 5031x4MB) SWAP 0/15296MB (cached 0MB) CPU [2%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,off,off,off,off] EMC_FREQ 18%@2133 GR3D_FREQ 55%@[305,0] VIC_FREQ 115 NVDLA0_FREQ @1369 APE 174 CV0@46.625C CPU@48.281C Tboard@36C SOC2@45.062C Tdiode@37.75C SOC0@45.218C CV1@45.781C GPU@42.843C tj@48.187C SOC1@44.375C CV2@42.656C VDD_GPU_SOC 3884mW/3219mW VDD_CPU_CV 1553mW/1164mW VIN_SYS_5V0 5350mW/4021mW NC 0mW/0mW VDDQ_VDD2_1V8AO 1918mW/1196mW NC 0mW/0mW
07-25-2024 10:47:35 RAM 3637/30593MB (lfb 5031x4MB) SWAP 0/15296MB (cached 0MB) CPU [0%@729,0%@729,1%@729,0%@729,4%@729,4%@729,0%@729,0%@729,off,off,off,off] EMC_FREQ 11%@2133 GR3D_FREQ 0%@[0,0] VIC_FREQ 115 APE 174 CV0@-256C CPU@48.312C Tboard@36C SOC2@44.937C Tdiode@37.75C SOC0@45.218C CV1@-256C GPU@-256C tj@48.156C SOC1@44.156C CV2@-256C VDD_GPU_SOC 3110mW/3205mW VDD_CPU_CV 1165mW/1164mW VIN_SYS_5V0 3729mW/3984mW NC 0mW/0mW VDDQ_VDD2_1V8AO 604mW/1122mW NC 0mW/0mW

When calling the submitTask interface of cudla, the GR3D_FREQ metric significantly increases. I believe this metric indicates GPU usage. Why does submitting a DLA task cause GPU usage to rise? My DLA model is a standalone model and expected to run entirely on the DLA. In comparison, when the submitTask is commented out, the GR3D_FREQ metric remains at 0.

Here is the nsight-systems profiling result, I can’t figure out where the GPU is used.

I want to know if this potential GPU usage is the reason behind the less-than-expected performance when the DLA model is integrated into my GPU-intensive application. Where does this potential GPU usage come from, and how can it be avoided?

Thanks.

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Hi,

Are you able to share the model instead of the reproducible source?
If you want to deploy the model solely on DLA, could you also try the standalone mode?

Thanks.