When GPU and DLA are used at the same time, the time consumption increases with each other

Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
[*] DRIVE OS 6.0.4 SDK
other

Target Operating System
[*] Linux
QNX
other

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
[*] other

SDK Manager Version
1.9.1.10844
[*] other

Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
[*] other

It takes 45ms to run all the way to the into8 model 0 on GPU 0 alone.
It takes 17ms to run all the way to the int8 model 1 on DLA0 alone (no need to fall back to GPU 0).
It takes 17ms to run the Intel 8 model 2 alone on DLA1 (no need to fall back to GPU0).

I did a couple of experiments:

  1. The time consumption of model 2 is DLA0: “latencyMs”: 17.6779ms, DLA1: “latencyMs”: 18.1355ms
  2. Two DLAs and one GPU run model 0, model 1, model 2, DLA 0: “latencyMs”: 21.8ms, DLA1: “latencyMs”: 21.8ms, GPU0: “latencyMs” 59ms.
  3. One channel DLA and one GPU run model 0, model 1, DLA 0 at the same time: “latencyMs”: 20.78ms, GPU0: “latencyMs” 51.05ms.
  4. All the way DLA runs model 1, model 2, DLA 0: “latencyMs”: 33ms

Experiment 2: Why does the delay between GPU0 and DLA affect each other, and how to avoid such problems.
Experiment 3: Why the latency between GPU0 and a DLA drops.
Experiment 4, why DLA has no parallel processing ability at all, the time to run n models will increase n times, how to make DLA can process multiple models at the same time.

Does DLA share memory bandwidth with the GPU, or are there any other resources that are shared? As a result, orin cannot fully exert the computing power of GPU 167 INT8 TOPS and DLA 87 INT8 TOPS. How do we get all the cores in parallel and not let the delays between them affect each other?

Dear @haihua.wei,
Did you use trtexec tool for this experminet? Just want to double check if you have made sure that GPU falls back is not happening.
Is it possible to share repro steps/models/code?

It’s trtexec, run the command down.
trtexec --loadEngine=./res0220_1_int8.trt --iterations=100 --exportTimes=./bev_time.json --exportProfile=./bev_profile.json --separateProfileRun &

trtexec --loadEngine=./joint_model_cygnus_simple_batch1_512960_230220_dla_opt_sub_dla.int8 --iterations=1000 --useDLACore=0 --exportTimes=./cygnus_dla1_time.json --separateProfileRun &

trtexec --loadEngine=./subgraph_pointpillars_input_int8.trt --iterations=1000 --useDLACore=1 --exportTimes=./cygnus_dla1_time.json --separateProfileRun &
@SivaRamaKrishnaNV

We can confirm that the model is all running on DLA, the --allowGPUFallback option is not turned on when building the model, and that the runtime sequence of the model that has been tracked using nsys profile is all run on DLA. @SivaRamaKrishnaNV

Dear @haihua.wei,
All the way DLA runs model 1, model 2, DLA 0: “latencyMs”: 33m

In this you are running both models on DLA0 in parallel instances of trtexec?

Yes, we also have a demand for a single DLA0 running multiple models.
@SivaRamaKrishnaNV

@SivaRamaKrishnaNV
Can you give some analysis tools to analyze the memory bandwidth usage when the model is running?

Dear @haihua.wei,
iGPU and DLA share same scheduler, memory resources. Though the task gets executed on separate HW(i.e iGPU and DLA), DLA will have a seperate GPU context to register the task finish signal from DLA. So the context switch between the two GPU context induce more latency in the pipeline. But the over all execution time can be less when run in parallel.
Using different trtexec process creates multiple GPU context. You can try launching the models from a single process so that multiple GPU contexts can be avoided. Even with single context, some delay is expected.
Currently, DLA can not run models in parallel. We are working on improving the GPU+DLA scenario to be efficiently. Using cuDLA library directly also an option to schedule work on DLA optimally. But cuDLA is not part Devzone release. You need to contact your NVIDIA representative in case you need access.

If we use cudla and nvscibuff, we can avoid this IGPU and DLA context switching problem.
@SivaRamaKrishnaNV

We don’t have tool for this.

Yes using cuDLA, you can avoid some overheads.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.