When GPU and DLA are used at the same time, the time consumption increases with each other

haihua.wei · March 8, 2023, 8:46am

Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
[*] DRIVE OS 6.0.4 SDK
other

Target Operating System
[*] Linux
QNX
other

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
[*] other

SDK Manager Version
1.9.1.10844
[*] other

Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
[*] other

It takes 45ms to run all the way to the into8 model 0 on GPU 0 alone.
It takes 17ms to run all the way to the int8 model 1 on DLA0 alone (no need to fall back to GPU 0).
It takes 17ms to run the Intel 8 model 2 alone on DLA1 (no need to fall back to GPU0).

I did a couple of experiments:

The time consumption of model 2 is DLA0: “latencyMs”: 17.6779ms, DLA1: “latencyMs”: 18.1355ms
Two DLAs and one GPU run model 0, model 1, model 2, DLA 0: “latencyMs”: 21.8ms, DLA1: “latencyMs”: 21.8ms, GPU0: “latencyMs” 59ms.
One channel DLA and one GPU run model 0, model 1, DLA 0 at the same time: “latencyMs”: 20.78ms, GPU0: “latencyMs” 51.05ms.
All the way DLA runs model 1, model 2, DLA 0: “latencyMs”: 33ms

Experiment 2: Why does the delay between GPU0 and DLA affect each other, and how to avoid such problems.
Experiment 3: Why the latency between GPU0 and a DLA drops.
Experiment 4, why DLA has no parallel processing ability at all, the time to run n models will increase n times, how to make DLA can process multiple models at the same time.

Does DLA share memory bandwidth with the GPU, or are there any other resources that are shared? As a result, orin cannot fully exert the computing power of GPU 167 INT8 TOPS and DLA 87 INT8 TOPS. How do we get all the cores in parallel and not let the delays between them affect each other?

SivaRamaKrishnaNV · March 8, 2023, 9:37am

Dear @haihua.wei,
Did you use trtexec tool for this experminet? Just want to double check if you have made sure that GPU falls back is not happening.
Is it possible to share repro steps/models/code?

haihua.wei · March 8, 2023, 10:42am

It’s trtexec, run the command down.
trtexec --loadEngine=./res0220_1_int8.trt --iterations=100 --exportTimes=./bev_time.json --exportProfile=./bev_profile.json --separateProfileRun &

trtexec --loadEngine=./joint_model_cygnus_simple_batch1_512960_230220_dla_opt_sub_dla.int8 --iterations=1000 --useDLACore=0 --exportTimes=./cygnus_dla1_time.json --separateProfileRun &

trtexec --loadEngine=./subgraph_pointpillars_input_int8.trt --iterations=1000 --useDLACore=1 --exportTimes=./cygnus_dla1_time.json --separateProfileRun &
@SivaRamaKrishnaNV

haihua.wei · March 8, 2023, 10:46am

We can confirm that the model is all running on DLA, the --allowGPUFallback option is not turned on when building the model, and that the runtime sequence of the model that has been tracked using nsys profile is all run on DLA. @SivaRamaKrishnaNV

SivaRamaKrishnaNV · March 8, 2023, 3:07pm

Dear @haihua.wei,
All the way DLA runs model 1, model 2, DLA 0: “latencyMs”: 33m

In this you are running both models on DLA0 in parallel instances of trtexec?

haihua.wei · March 9, 2023, 2:14am

Yes, we also have a demand for a single DLA0 running multiple models.
@SivaRamaKrishnaNV

haihua.wei · March 9, 2023, 7:43am

@SivaRamaKrishnaNV
Can you give some analysis tools to analyze the memory bandwidth usage when the model is running?

SivaRamaKrishnaNV · March 9, 2023, 8:55am

Dear @haihua.wei,
iGPU and DLA share same scheduler, memory resources. Though the task gets executed on separate HW(i.e iGPU and DLA), DLA will have a seperate GPU context to register the task finish signal from DLA. So the context switch between the two GPU context induce more latency in the pipeline. But the over all execution time can be less when run in parallel.
Using different trtexec process creates multiple GPU context. You can try launching the models from a single process so that multiple GPU contexts can be avoided. Even with single context, some delay is expected.
Currently, DLA can not run models in parallel. We are working on improving the GPU+DLA scenario to be efficiently. Using cuDLA library directly also an option to schedule work on DLA optimally. But cuDLA is not part Devzone release. You need to contact your NVIDIA representative in case you need access.

haihua.wei · March 9, 2023, 9:51am

If we use cudla and nvscibuff, we can avoid this IGPU and DLA context switching problem.
@SivaRamaKrishnaNV

SivaRamaKrishnaNV · March 9, 2023, 11:08am

We don’t have tool for this.

Yes using cuDLA, you can avoid some overheads.

system · May 29, 2023, 2:00pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
DLA and GPU cores at the same time Jetson AGX Xavier dla	20	10695	October 18, 2021
Why run slower when use DLA and GPU together , even if the DLA model was transfromed all in DLA? Jetson Xavier NX dla	7	1368	October 18, 2021
DLA and GPU running at the same time, performance degradation Jetson Xavier NX dla	2	701	October 18, 2021
Performance about igpu and dla DRIVE AGX Xavier General driveos-dl	9	1392	October 12, 2021
Run pure conv2d node on DLA makes GPU get slower Jetson AGX Orin tensorrt	8	1526	July 12, 2022
Model timing impacted when used Both DLA & GPU simultaneously Jetson AGX Xavier dla	5	821	December 28, 2022
Does DLA work faster than GPU in fp16 model? Jetson AGX Xavier dla	18	3175	June 8, 2022
Running models in 2 DLAs DeepStream SDK	4	328	November 9, 2023
DLA performance is not as expected Jetson AGX Orin dla	7	420	August 14, 2024
DLA Inference Latency Issue on Orin Platform DRIVE AGX Orin General driveos-dl	8	189	September 19, 2025

When GPU and DLA are used at the same time, the time consumption increases with each other

Related topics