DLA and GPU running at the same time - performance question

Hi @AastaLLL

I’ve ran into some issues with my project when running DLA and GPU at the same time, so I went back to your sample.
As far as I understand, I see the same behavior in your test code.
I ran the test under nvprof in the following manner: ./test -1 --> Result is shown in gpu_only.jpg
Ran ./test -1 0 --> Result is shown in gpu_dla.jpg

I’ve instrumented the code with an empty cuda kernel and called it before and after the enqueue call.
That way I know when the GPU/DLA functionality starts and ends.

As can be seen in the attached images, when I run the GPU alone, it takes ~230-250ms. When the GPU and DLA run together it takes about ~500ms .

The gpu_dla.jpg shows that the dla and gpu run concurrently (at least to some degree) but it seems they block/interfere with each other. This is what I also see in my test code.

I understand that this might be due to lack of resources/network configuration/network layers etc… however the end result is that moving even a simple test to the DLA did not improve performance over running everything alone on the GPU.

Any insights would be greatly appreciated.

Reference post is in:

gpu_only


thanks
Eyal

Hi,

Would you mind to share your customized code of the empty kernel with us?

Not sure how do you handle the CUDA stream for inference/kernel on GPU/DLA.
This should cause the different behavior in the GPU scheduling.

Thanks.

Hi,
Attached is the code you’ve originally sent, changes are in the exec_once method of the Task class.

main.txt (5.7 KB)

Hi @AastaLLL,
Do you need further information for this?

thanks
Eyal

Hi @AastaLLL, any assistance here would be greatly appreciated

thanks
Eyal

Hi @AastaLLL and @kayccc,
I’d appreciate any response about this issue. Currently we can not use the DLAs, resulting in lower performance.

thanks
Eyal

Hi,

Sorry for the late update.
Would you mind to set the following environment parameter to see if help first?

CUDA_DEVICE_MAX_CONNECTIONS=4 ./test [argument]

Thanks.

Hi @AastaLLL

There’s no change.
I see it both in nvprof and in the FPS printed by the application itself.

nvidia@nvidia-desktop:~/MyCode/NvidiaConcurrentSample$ ./test -1
Load engine from :gpu.engine
FPS: 6393.62
FPS: 6086.69
FPS: 6442.61
FPS: 6008.84
FPS: 6537.36

nvidia@nvidia-desktop:~/MyCode/NvidiaConcurrentSample$ ./test 0
Load engine from :dla.engine
FPS: 1319.87
FPS: 1360.68
FPS: 1381.76

nvidia@nvidia-desktop:~/MyCode/NvidiaConcurrentSample$ ./test 0
Load engine from :dla.engine
FPS: 1366.91
FPS: 1394.72

nvidia@nvidia-desktop:~/MyCode/NvidiaConcurrentSample$ ./test -1 0
Load engine from :gpu.engine
Load engine from :dla.engine
FPS: 2769.46 725.914
FPS: 2828.35 744.862
FPS: 2710.54 734.875
FPS: 2552.58 717.881

nvidia@nvidia-desktop:~/MyCode/NvidiaConcurrentSample$ CUDA_DEVICE_MAX_CONNECTIONS=4 ./test -1 0
Load engine from :gpu.engine
Load engine from :dla.engine
FPS: 2528.99 690.905
FPS: 2836.46 742.717
FPS: 2807.82 749.844
FPS: 2479.49 720.853
FPS: 2536.4 725.848
FPS: 2459.43 710.819

Thanks.

We are checking this issue internally.
Will update more information with you later.

1 Like

Hi @AastaLLL any update on this issue?

thanks
Eyal

Hi,

Sorry for keeping you waiting.
We are checking this issue actively. Will update to you once we get more complete information.

Thanks.

1 Like

Hi @AastaLLL, sorry for nudging… any update you can share please?

thanks
Eyal

Hi @AastaLLL @kayccc,
Guys any update on this? We’re really stuck on this issue, preventing us from running our algo on the Xavier.

Thanks
Eyal

Hi,

Sorry that we are still checking this issue.
Will keep you update once we got any progress.

Thanks.

Hi Guys,
Any progress what so ever?
@AastaLLL @kayccc

thanks
Eyal

Hi,

Sorry for keep you waiting.

The cause of this issue is from some false dependencies on pthread mutex, which force TensorRT to wait (either GPU or DLA).
It can be reproduced with DLA+GPU or GPU+GPU. Based on this, it seems some issues on the application.

Not sure if you have tried the trtexec binary located at /usr/src/tensorrt/bin?
It also supports multiple streams inference but doesn’t have this performance regression.
It’s worthy to try if updating the trtexec for multiple engines can fix your issue or not.

Thanks.

Hi @AastaLLL
Thanks a lot for the answer. The application is actually something you yourself sent in the past - and the issue occurs in my code as well.
So just to make sure I understand - this is something to do with an internal pthread issue? linux issue?
I don’t quite understand what is the suggested solution, can you please elaborate?
I’ll have a look in the trtexec code - I should have its C++ code right?

thanks a lot,
Eyal

Hi,

We do know the application is sent from us. Just trying to share some status with you.
Sorry if this make you feel confused.

In trtexec, the API for threading (CPU->GPU) is a little bit different (lower-level).
For time concern, it can be an alternative solution to try.

For the app we shared, the issue is from some unknown latency when CPU launch GPU tasks.
The launch is applied frequently since each operation within a model is a separate GPU task. (if not merged)

We observe the launching will have some unknown latency (waif for mutex), which may related to priority.
But we still need more time to figure out the detail.

Thanks.

Hi @AastaLLL,
Thanks a lot for the information. Is the trtexec code open so I can have a look at how threading is done there? Where is it?

thanks a lot for all the effort to assist. If you have further information as to how to use the default threading mechanism, I would appreciate if you could share this information as well.

Eyal

Hi @AastaLLL,
Sorry for being a nudge and bring this up again… I don’t see any threading reference in the trtexec code.
The execute method runs on a CUDA stream but I don’t see any threads going on there.

Could you please point me to what you’ve suggested there?

thanks
Eyal