DLA and GPU cores at the same time

Hi @AastaLLL,
I’ve ran into some issues with my project when running DLA and GPU at the same time, so I went back to your sample.
As far as I understand, I see the same behavior in your test code.
I ran the test under nvprof in the following manner: ./test -1 → Result is shown in gpu_only.jpg
Ran ./test -1 0 → Result is shown in gpu_dla.jpg

I’ve instrumented the code with an empty cuda kernel and called it before and after the enqueue call.
That way I know when the GPU/DLA functionality starts and ends.

As can be seen in the attached images, when I run the GPU alone, it takes ~230-250ms. When the GPU and DLA run together it takes about ~500ms .

The gpu_dla.jpg shows that the dla and gpu run concurrently (at least to some degree) but it seems they block/interfere with each other. This is what I also see in my test code.

I understand that this might be due to lack of resources/network configuration/network layers etc… however the end result is that moving even a simple test to the DLA did not improve performance over running everything alone on the GPU.

Any insights would be greatly appreciated.

thanks
Eyal