Profiling DLA with GPU fallback on Jetson Xavier

Hello Nvidia,
I am trying to evaluate the ratio of inference time between DLA and GPU in the case of DLA inference with fallback to GPU for non-supported layers/configs.
To my knowledge, the only way to do that is using the Nsight Systems profiler.
I am using Mobilenet-v1 example from onnx and with trtexec with batchsize=1. I was able to get profiling reports with DLA activation and tasks. However, I see some inconsistencies between the TensorRT timeline and the DLA timeline.
This is the Nsight report:

The first inference pass, is expectedly slower and corresponds to the DLA timeline, however for the next passes/iterations, the TensorRT inference duration (yellow) is much smaller than the DLA task (purple).
My questions are the following:

  1. Does the report make sense to you, and what is the rest of the timing of the DLA task where there are no layers executing?
  2. Concretely, I want to quantify (time_on_DLA / inference_time)% in order to estimate the power efficiency between GPU only vs DLA with GPU fallback, however, the DLA task time is larger than the whole layers inference time, how can I estimate this?
    I have the layers running on each device, but I am not sure I should count the TensorRT timeline duration for each layer, because of the difference with corresponding DLA task timeline.
    Or do you propose another way to approach the problem? I don’t have much knowledge on Nsight systems.

Here is the list of layers assignment for reference:

Any help would be much appreciated.


TensorRT Version :
GPU Type : Jetson Xavier
Nvidia Driver Version : from JetPack 4.4.1
CUDA Version : 10.2.89
CUDNN Version :
Operating System + Version : Ubuntu 18.04
Python Version (if applicable) : Python 3.6
TensorFlow Version (if applicable) :
PyTorch Version (if applicable) :
Baremetal or Container (if container which image + tag) : Baremetal

No feedback yet?

Moving this from the Nsight Systems to the Jetson Xavier forum.

Hi nv-u,

Sorry for the late response, have you managed to get issue resolved? or still need help?


  1. One possible reason is the short idle time of DLA.
    Since it is an independent hardware, the duration will be combined if the profiling doesn’t catch the idle period.

  2. You can calculate the fallback part as alternative.
    For example, you may find GPU utilization is 20% if running the engine on DLA.
    So you can roughly say the (time_on_DLA / inference_time)% is 80%.


Hi kayccc, I think I have solved this issue. I forgot to post it here, so sorry!

ExecutionContext::Enqueue is an asynchronous interface, it will return immediately after pushing tasks into the internal queue. So the inference_time is not the duration of this interface.

Also thanks to AastaLLL for the informative comment.

But actually, I have another unsolved issue on how to count data I concerned from sqlite DB, please see here. Would be greatly appreciated if anyone replies to it.