DLA and GPU running at the same time - performance question

eyalhir74 · August 6, 2020, 5:35am

I’ve ran into some issues with my project when running DLA and GPU at the same time, so I went back to your sample.
As far as I understand, I see the same behavior in your test code.
I ran the test under nvprof in the following manner: ./test -1 → Result is shown in gpu_only.jpg
Ran ./test -1 0 → Result is shown in gpu_dla.jpg

I’ve instrumented the code with an empty cuda kernel and called it before and after the enqueue call.
That way I know when the GPU/DLA functionality starts and ends.

As can be seen in the attached images, when I run the GPU alone, it takes ~230-250ms. When the GPU and DLA run together it takes about ~500ms .

The gpu_dla.jpg shows that the dla and gpu run concurrently (at least to some degree) but it seems they block/interfere with each other. This is what I also see in my test code.

I understand that this might be due to lack of resources/network configuration/network layers etc… however the end result is that moving even a simple test to the DLA did not improve performance over running everything alone on the GPU.

Any insights would be greatly appreciated.

Reference post is in:

gpu_only

thanks
Eyal

AastaLLL · August 6, 2020, 8:42am

Hi,

Would you mind to share your customized code of the empty kernel with us?

Not sure how do you handle the CUDA stream for inference/kernel on GPU/DLA.
This should cause the different behavior in the GPU scheduling.

Thanks.

eyalhir74 · August 6, 2020, 8:47am

Hi,
Attached is the code you’ve originally sent, changes are in the exec_once method of the Task class.

main.txt (5.7 KB)

eyalhir74 · August 12, 2020, 4:57am

Hi @AastaLLL,
Do you need further information for this?

thanks
Eyal

eyalhir74 · August 21, 2020, 2:49pm

Hi @AastaLLL, any assistance here would be greatly appreciated

thanks
Eyal

eyalhir74 · August 23, 2020, 5:11am

Hi @AastaLLL and @kayccc,
I’d appreciate any response about this issue. Currently we can not use the DLAs, resulting in lower performance.

thanks
Eyal

AastaLLL · August 26, 2020, 7:24am

Hi,

Sorry for the late update.
Would you mind to set the following environment parameter to see if help first?

CUDA_DEVICE_MAX_CONNECTIONS=4 ./test [argument]

Thanks.

eyalhir74 · August 26, 2020, 10:05am

Hi @AastaLLL

There’s no change.
I see it both in nvprof and in the FPS printed by the application itself.

nvidia@nvidia-desktop:~/MyCode/NvidiaConcurrentSample$ ./test -1
Load engine from :gpu.engine
FPS: 6393.62
FPS: 6086.69
FPS: 6442.61
FPS: 6008.84
FPS: 6537.36

nvidia@nvidia-desktop:~/MyCode/NvidiaConcurrentSample$ ./test 0
Load engine from :dla.engine
FPS: 1319.87
FPS: 1360.68
FPS: 1381.76

nvidia@nvidia-desktop:~/MyCode/NvidiaConcurrentSample$ ./test 0
Load engine from :dla.engine
FPS: 1366.91
FPS: 1394.72

nvidia@nvidia-desktop:~/MyCode/NvidiaConcurrentSample$ ./test -1 0
Load engine from :gpu.engine
Load engine from :dla.engine
FPS: 2769.46 725.914
FPS: 2828.35 744.862
FPS: 2710.54 734.875
FPS: 2552.58 717.881

nvidia@nvidia-desktop:~/MyCode/NvidiaConcurrentSample$ CUDA_DEVICE_MAX_CONNECTIONS=4 ./test -1 0
Load engine from :gpu.engine
Load engine from :dla.engine
FPS: 2528.99 690.905
FPS: 2836.46 742.717
FPS: 2807.82 749.844
FPS: 2479.49 720.853
FPS: 2536.4 725.848
FPS: 2459.43 710.819

AastaLLL · August 27, 2020, 7:05am

Thanks.

We are checking this issue internally.
Will update more information with you later.

eyalhir74 · September 3, 2020, 4:35am

Hi @AastaLLL any update on this issue?

thanks
Eyal

AastaLLL · September 4, 2020, 5:12am

Hi,

Sorry for keeping you waiting.
We are checking this issue actively. Will update to you once we get more complete information.

Thanks.

eyalhir74 · September 15, 2020, 5:54am

Hi @AastaLLL, sorry for nudging… any update you can share please?

thanks
Eyal

eyalhir74 · September 27, 2020, 7:32am

Hi @AastaLLL @kayccc,
Guys any update on this? We’re really stuck on this issue, preventing us from running our algo on the Xavier.

Thanks
Eyal

AastaLLL · September 29, 2020, 5:55am

Hi,

Sorry that we are still checking this issue.
Will keep you update once we got any progress.

Thanks.

eyalhir74 · October 11, 2020, 4:12am

Hi Guys,
Any progress what so ever?
@AastaLLL @kayccc

thanks
Eyal

AastaLLL · October 14, 2020, 5:29am

Hi,

Sorry for keep you waiting.

The cause of this issue is from some false dependencies on pthread mutex, which force TensorRT to wait (either GPU or DLA).
It can be reproduced with DLA+GPU or GPU+GPU. Based on this, it seems some issues on the application.

Not sure if you have tried the trtexec binary located at /usr/src/tensorrt/bin?
It also supports multiple streams inference but doesn’t have this performance regression.
It’s worthy to try if updating the trtexec for multiple engines can fix your issue or not.

Thanks.

eyalhir74 · October 14, 2020, 5:35am

Hi @AastaLLL
Thanks a lot for the answer. The application is actually something you yourself sent in the past - and the issue occurs in my code as well.
So just to make sure I understand - this is something to do with an internal pthread issue? linux issue?
I don’t quite understand what is the suggested solution, can you please elaborate?
I’ll have a look in the trtexec code - I should have its C++ code right?

thanks a lot,
Eyal

AastaLLL · October 15, 2020, 3:19am

Hi,

We do know the application is sent from us. Just trying to share some status with you.
Sorry if this make you feel confused.

In trtexec, the API for threading (CPU->GPU) is a little bit different (lower-level).
For time concern, it can be an alternative solution to try.

For the app we shared, the issue is from some unknown latency when CPU launch GPU tasks.
The launch is applied frequently since each operation within a model is a separate GPU task. (if not merged)

We observe the launching will have some unknown latency (waif for mutex), which may related to priority.
But we still need more time to figure out the detail.

Thanks.

eyalhir74 · October 19, 2020, 5:23am

Hi @AastaLLL,
Thanks a lot for the information. Is the trtexec code open so I can have a look at how threading is done there? Where is it?

thanks a lot for all the effort to assist. If you have further information as to how to use the default threading mechanism, I would appreciate if you could share this information as well.

Eyal

eyalhir74 · November 2, 2020, 1:45pm

Hi @AastaLLL,
Sorry for being a nudge and bring this up again… I don’t see any threading reference in the trtexec code.
The execute method runs on a CUDA stream but I don’t see any threads going on there.

Could you please point me to what you’ve suggested there?

thanks
Eyal

Topic		Replies	Views
DLA and GPU cores at the same time Jetson AGX Xavier dla	20	10287	October 18, 2021
Does DLA work faster than GPU in fp16 model? Jetson AGX Xavier dla	18	2744	June 8, 2022
DLA / GPU question Jetson AGX Xavier dla	6	940	October 18, 2021
Record traces of DLA Jetson Nano dla	5	22	May 5, 2025
Can not make tensorrt work on DLA (Jetson Xavier) Jetson AGX Xavier tensorrt , dla	3	578	October 18, 2021
Jetson AGX Xavier DDR Test Jetson AGX Xavier performance	16	1727	October 18, 2021
Can NVDLA and GPU work in parallel? TensorRT	3	2113	October 12, 2021
Tensorrt Python API has a bug in DLA usage Jetson AGX Xavier tensorrt	11	634	August 17, 2022
Run GPU and DLAs concurrently Jetson AGX Xavier dla	4	649	October 18, 2021
Unable to verify Xavier inference benchmarks Jetson AGX Xavier	17	2279	October 18, 2021

DLA and GPU running at the same time - performance question

Related topics