Run GPU and DLAs concurrently


i’m using Jetpack 4.3 and TensorRT 6.0.

I try to run three networks on the Xavier AGX. The largest is run on the GPU and the others on the DLA0 and DLA1. The inference is running in three threads, for each hardware one.

But it seems the GPU and DLAs run serially and not concurrently. (view image)

I used trtexec to generate the engines and the engines for the DLAs were build without GPUFallback. All Layers that are supposed to run on the DLAs are supported.

I used nvvp for profiling.

Thanks in Advance


Our profiler doesn’t support DLA profiling yet.
So you can only find the timeslot when DLA thread use GPU for reformatting (data transfer).
The real inference part is missing.



I know that the profiler doesn’t profile the DLAs.
But at the time the DLAs get executed the GPU is idle and GPUFallback is disabled.

If i synchronize the DLAs and the GPU the duration for the inference is the same.


I solved my problem.

For me the solution was to start the inference of the DLAs earlier and i also created an additional thread for the GPU inference.
I started the inference of the DLAs right after the inference of the GPU started.
With this adaption the overall execution time is reduced.

Maybe this could help someone in the future.