DLA and GPU cores at the same time

I’m trying to concurrently run 3 networks. 1 on DLA0, 1 on DLA1 and the third on the GPU, using the setDefaultDeviceType and setDLACore functions.
I’m using TensorRT 5.

After building the 3 networks (which seems to be doing the work correctly), I use the following code to launch them together:

for (int loop = 0; loop < 50; loop++) for (int i = 0; i < 3; i++) m_execution_engines[i]->enqueue(1, buffers[i], streams[i], nullptr);

Judging from the timings of them, they run serially and not concurrently.

Anything I miss here?


Actually, there is a bug of DLA, that you can’t run 2 DLA in same process.
You can program it, but 2 DLA will not work at the same time.
This is a known issue and it’s said to be fixed in the future.

Thanks for the answer. I removed one of the DLAs - still the GPU and DLA0 seems to be running serially.
Any other suggestions?

CPU Total time : [470.815 ms]
CPU timer average : [4.70815 ms]
[DLA 1]: Total time for 0 layers: 0ms
[DLA 1]: Shortest layer: [] ran for [3.40282e+38ms]
[DLA 1]: Longest layer: [] ran for [0ms]
[DLA 1]: Average time : [0ms]
[DLA 0]: Total time for 500 layers: 182.218ms
[DLA 0]: Shortest layer: [output copy finish] ran for [0.00352ms]
[DLA 0]: Longest layer: [output from nvm] ran for [1.99443ms]
[DLA 0]: Average time : [0.363708ms]
[GPU]: Total time for 100 layers: 264.819ms
[GPU]: Shortest layer: [(Unnamed Layer* 0) [Convolution]] ran for [2.5999ms]
[GPU]: Longest layer: [(Unnamed Layer* 0) [Convolution]] ran for [4.1215ms]
[GPU]: Average time : [2.62197ms]


I guess it’s because you run them in the same thread.
Would you try it in same process but separate thread?


Here is a sample to run GPU and DLAs at the same time.

1. Please prepare TensorRT engine of GPU and DLA with trtexec first.
For example,

$ /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --saveEngine=gpu.engine
$ /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --useDLACore=0 --allowGPUFallback --saveEngine=dla.engine

2. Compile
TensorRT_sample.zip (3.5 KB)

$ unzip TensorRT_sample.zip
$ cd TensorRT_sample
$ make

3. Test
Please put the gpu.engine and dla.engine generated in step1 to the TensorRT_sample.
The command runs like this

$ ./test  <thread0> <thread1> <thread2>, ..., <threadK>  # -1: GPU, 0: DLA0, 1: DLA1

Ex. Run GPU+DLA0+DLA1.

$ ./test -1 0 1


Wow… thanks a lot I’ll have a look at the code.
A few followups though from a quick run on my Xavier.

  1. Running with 2 DLAs causes a seg fault. Maybe this is because of what juns said? Didn’t check yet.
    ./test -1 -1 -1 --> Works fine
    ./test -1 0 --> works fine
    ./test -1 0 0 -> crashes

  2. Running with only GPU, the code says it gets to ~7700 FPS.
    Running on either of the DLAs, the code says it get to ~1200FPS
    Running both GPU and one of the DLAs, it gets to: FPS: 806.86 3279.44

    So I guess I have two questions about this:

    • I still don’t understand why the DLAs run slower than the “regular CUDA GPU cores”?
      wasn’t the DLA suppose to be specialized hardware that would accelerate the networks compared to
      regular GPU?
    • Seems that GPU and DLA running together is SLOWER than only the GPU alone? How could that be?



If you want to use the two DLAs at the same time, you should assign index 0 and index 1.

./test -1 0 1

The crash may cause from out of resource of DLA0.

DLA is a hardware engine. The clock and design can be found here:

You can also refer to this introduction. DLA’s TOPS and capacity is smaller than GPU.

Since DLA is a hardware-based engine, not all the TensorRT operation can be deployed on it.
For those non-supported operations, DLA will fallback them to the GPU.
As a result, GPU may run slower since part of resource is occupied.


1 Like

Thanks a lot for the answer!

A small followup: In the Jetson link you’ve said it says
“The Jetson AGX Xavier integrated Volta GPU, shown in figure 3, provides 512 CUDA cores and 64 Tensor Cores for up to 11 TFLOPS FP1”

In Annandtech (here: https://www.anandtech.com/show/13584/nvidia-xavier-agx-hands-on-carmel-and-more)
it says:
"This augments the total processing power of the GPU by up to 22.6 8-bit TOPs or 11.3 FP16 TOPS on the part of the Tensor cores, on top of the respectively 2.8 and 1.4 TFLOPs for FP16 and FP32 CUDA operations provided by the SMs.

NVIDIA’s DLA promises performances of up to 11.4 int8 TOPS and is also capable of FP16 operation at half speed at 5.7 TOPS"

According to Annandtech, and if possible, I would be able to get a 11FP16 TFlops from the TensorCores while getting another 5.7 from each of the DLA’s ? and still have some room in the regular CUDA cores?

I understand this is highly unlikely to achieve this high performance, but generally speaking I should be able to get close to say, 11 + 5.7 + 5.7 FP16 TFlops on the Xavier?

Which one of them is true?


Sorry for the late update.

The 11 TFLOPS use all the GPU cores: including 512 CUDA cores and 64 Tensor Cores.
Each DLA can reach 2.5TFLOPS.

It’s possible that use all the processor at the same time.
You can check our GitHub as an example to do so:



Thanks alot again :)

Can you please specify the breakdown between the TFLOPS of the CUDA cores and the Tensor cores?
Knowing that I would be able to estimate how much more FLOPs I can get from the hardware if I know that my code runs currently only on the CUDA cores, right?

As for the DLA, you wrote “Each DLA can reach TFLOPS” - did you mean each can reach 1TFLOPS? something else?

many thanks again


The score is measured together.
In general, we don’t force an operation to run on GPU core or Tensor core.
GPU scheduler will do the assignment for you.

Not sure.
For a GPU task, the scheduler will do the optimal assignment.
If your job can run on Tensor Core, it should be assign to Tensor Core already.

Sorry for the typo. It should be 2.5 TFLOPS.
More detail can be found here:



Hello AastaLLL,

I would like to ask you a question base your provided code.

I profiled your code and I saw the result as in the image:

Based on the image:
After the inference (ExecutionContext::enqueue), there are cudaEventRecord and cudaEventSynchronize.
I would like to ask about what is the total execution time of inference. Does it include the time of cudaEventRecord and cudaEventSynchronize or only time of (ExecutionContext::enqueue) ?

Thank you.

Hi @AastaLLL,
I’ve ran into some issues with my project when running DLA and GPU at the same time, so I went back to your sample.
As far as I understand, I see the same behavior in your test code.
I ran the test under nvprof in the following manner: ./test -1 --> Result is shown in gpu_only.jpg
Ran ./test -1 0 --> Result is shown in gpu_dla.jpg

I’ve instrumented the code with an empty cuda kernel and called it before and after the enqueue call.
That way I know when the GPU/DLA functionality starts and ends.

As can be seen in the attached images, when I run the GPU alone, it takes ~230-250ms. When the GPU and DLA run together it takes about ~500ms .

The gpu_dla.jpg shows that the dla and gpu run concurrently (at least to some degree) but it seems they block/interfere with each other. This is what I also see in my test code.

I understand that this might be due to lack of resources/network configuration/network layers etc… however the end result is that moving even a simple test to the DLA did not improve performance over running everything alone on the GPU.

Any insights would be greatly appreciated.


Hi same and ryalhir74,

Please open a new topic for your issue. Thanks

1 Like

hi, @AastaLLL

I use your code and execute test with a model that have no layer fallback to gpu, but I find the speed on gpu is slow when working with dla, here is my screenshot:

only run gpu.engine on gpu device with:
./test -1
and the speed is:
Screen Shot 2020-08-17 at 09.51.26

but when run gpu.engine and dla.engine at the same time:
./test -1 0
Screen Shot 2020-08-17 at 09.52.54
you can see the fps on gpu drop from ~200 to ~180

the speed is even slower when run two dla models at the same time:
./test -1 0 1
Screen Shot 2020-08-17 at 09.55.27

I am completely sure the dla model have no layers fallback to gpu, so I can not understand why the gpu speed slower when run dla models, hope your replay, thanks!

I see the same thing…as if the DLA and GPU interfere with each other.

how do you solve this?

I didn’t :( Waiting for ideas/directions from NVIDIA :)

@AastaLLL can you check this?

Hi angless,

Please help to open a new topic for your issue. Thanks