DLA and GPU running at the same time - performance question

Hi @AastaLLL

I’ve ran into some issues with my project when running DLA and GPU at the same time, so I went back to your sample.
As far as I understand, I see the same behavior in your test code.
I ran the test under nvprof in the following manner: ./test -1 --> Result is shown in gpu_only.jpg
Ran ./test -1 0 --> Result is shown in gpu_dla.jpg

I’ve instrumented the code with an empty cuda kernel and called it before and after the enqueue call.
That way I know when the GPU/DLA functionality starts and ends.

As can be seen in the attached images, when I run the GPU alone, it takes ~230-250ms. When the GPU and DLA run together it takes about ~500ms .

The gpu_dla.jpg shows that the dla and gpu run concurrently (at least to some degree) but it seems they block/interfere with each other. This is what I also see in my test code.

I understand that this might be due to lack of resources/network configuration/network layers etc… however the end result is that moving even a simple test to the DLA did not improve performance over running everything alone on the GPU.

Any insights would be greatly appreciated.

Reference post is in:

gpu_only


thanks
Eyal

Hi,

Would you mind to share your customized code of the empty kernel with us?

Not sure how do you handle the CUDA stream for inference/kernel on GPU/DLA.
This should cause the different behavior in the GPU scheduling.

Thanks.

Hi,
Attached is the code you’ve originally sent, changes are in the exec_once method of the Task class.

main.txt (5.7 KB)

Hi @AastaLLL,
Do you need further information for this?

thanks
Eyal

Hi @AastaLLL, any assistance here would be greatly appreciated

thanks
Eyal

Hi @AastaLLL and @kayccc,
I’d appreciate any response about this issue. Currently we can not use the DLAs, resulting in lower performance.

thanks
Eyal

Hi,

Sorry for the late update.
Would you mind to set the following environment parameter to see if help first?

CUDA_DEVICE_MAX_CONNECTIONS=4 ./test [argument]

Thanks.

Hi @AastaLLL

There’s no change.
I see it both in nvprof and in the FPS printed by the application itself.

nvidia@nvidia-desktop:~/MyCode/NvidiaConcurrentSample$ ./test -1
Load engine from :gpu.engine
FPS: 6393.62
FPS: 6086.69
FPS: 6442.61
FPS: 6008.84
FPS: 6537.36

nvidia@nvidia-desktop:~/MyCode/NvidiaConcurrentSample$ ./test 0
Load engine from :dla.engine
FPS: 1319.87
FPS: 1360.68
FPS: 1381.76

nvidia@nvidia-desktop:~/MyCode/NvidiaConcurrentSample$ ./test 0
Load engine from :dla.engine
FPS: 1366.91
FPS: 1394.72

nvidia@nvidia-desktop:~/MyCode/NvidiaConcurrentSample$ ./test -1 0
Load engine from :gpu.engine
Load engine from :dla.engine
FPS: 2769.46 725.914
FPS: 2828.35 744.862
FPS: 2710.54 734.875
FPS: 2552.58 717.881

nvidia@nvidia-desktop:~/MyCode/NvidiaConcurrentSample$ CUDA_DEVICE_MAX_CONNECTIONS=4 ./test -1 0
Load engine from :gpu.engine
Load engine from :dla.engine
FPS: 2528.99 690.905
FPS: 2836.46 742.717
FPS: 2807.82 749.844
FPS: 2479.49 720.853
FPS: 2536.4 725.848
FPS: 2459.43 710.819

Thanks.

We are checking this issue internally.
Will update more information with you later.

1 Like

Hi @AastaLLL any update on this issue?

thanks
Eyal

Hi,

Sorry for keeping you waiting.
We are checking this issue actively. Will update to you once we get more complete information.

Thanks.

1 Like

Hi @AastaLLL, sorry for nudging… any update you can share please?

thanks
Eyal