I’ve ran into some issues with my project when running DLA and GPU at the same time, so I went back to your sample.
As far as I understand, I see the same behavior in your test code.
I ran the test under nvprof in the following manner: ./test -1 --> Result is shown in gpu_only.jpg
Ran ./test -1 0 --> Result is shown in gpu_dla.jpg
I’ve instrumented the code with an empty cuda kernel and called it before and after the enqueue call.
That way I know when the GPU/DLA functionality starts and ends.
As can be seen in the attached images, when I run the GPU alone, it takes ~230-250ms. When the GPU and DLA run together it takes about ~500ms .
The gpu_dla.jpg shows that the dla and gpu run concurrently (at least to some degree) but it seems they block/interfere with each other. This is what I also see in my test code.
I understand that this might be due to lack of resources/network configuration/network layers etc… however the end result is that moving even a simple test to the DLA did not improve performance over running everything alone on the GPU.
Any insights would be greatly appreciated.