Profiling DLA with GPU fallback on Jetson Xavier

youcef4tak · February 4, 2021, 1:36pm

Hello Nvidia,
I am trying to evaluate the ratio of inference time between DLA and GPU in the case of DLA inference with fallback to GPU for non-supported layers/configs.
To my knowledge, the only way to do that is using the Nsight Systems profiler.
I am using Mobilenet-v1 example from onnx and with trtexec with batchsize=1. I was able to get profiling reports with DLA activation and tasks. However, I see some inconsistencies between the TensorRT timeline and the DLA timeline.
This is the Nsight report:

The first inference pass, is expectedly slower and corresponds to the DLA timeline, however for the next passes/iterations, the TensorRT inference duration (yellow) is much smaller than the DLA task (purple).
My questions are the following:

Does the report make sense to you, and what is the rest of the timing of the DLA task where there are no layers executing?
Concretely, I want to quantify (time_on_DLA / inference_time)% in order to estimate the power efficiency between GPU only vs DLA with GPU fallback, however, the DLA task time is larger than the whole layers inference time, how can I estimate this?
I have the layers running on each device, but I am not sure I should count the TensorRT timeline duration for each layer, because of the difference with corresponding DLA task timeline.
Or do you propose another way to approach the problem? I don’t have much knowledge on Nsight systems.

Here is the list of layers assignment for reference:

Any help would be much appreciated.

Environment

TensorRT Version : 7.1.3.0
GPU Type : Jetson Xavier
Nvidia Driver Version : from JetPack 4.4.1
CUDA Version : 10.2.89
CUDNN Version : 8.0.0.180
Operating System + Version : Ubuntu 18.04
Python Version (if applicable) : Python 3.6
TensorFlow Version (if applicable) :
PyTorch Version (if applicable) :
Baremetal or Container (if container which image + tag) : Baremetal

nv-u · May 31, 2021, 12:27pm

No feedback yet?

TomNVIDIA · June 14, 2021, 7:31pm

Moving this from the Nsight Systems to the Jetson Xavier forum.

kayccc · June 23, 2021, 2:15am

Hi nv-u,

Sorry for the late response, have you managed to get issue resolved? or still need help?

AastaLLL · June 23, 2021, 3:14am

Hi,

One possible reason is the short idle time of DLA.
Since it is an independent hardware, the duration will be combined if the profiling doesn’t catch the idle period.
You can calculate the fallback part as alternative.
For example, you may find GPU utilization is 20% if running the engine on DLA.
So you can roughly say the (time_on_DLA / inference_time)% is 80%.

Thanks.

nv-u · June 23, 2021, 8:33am

Hi kayccc, I think I have solved this issue. I forgot to post it here, so sorry!

ExecutionContext::Enqueue is an asynchronous interface, it will return immediately after pushing tasks into the internal queue. So the inference_time is not the duration of this interface.

Also thanks to AastaLLL for the informative comment.

But actually, I have another unsolved issue on how to count data I concerned from sqlite DB, please see here. Would be greatly appreciated if anyone replies to it.

system · August 29, 2021, 5:54am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Profile results of model running on DLA mismatch between TensorRT and nsys Jetson AGX Orin tensorrt , dla	10	1061	April 5, 2023
DLA and GPU running at the same time - performance question Jetson AGX Xavier nvbugs , performance , dla	24	3140	October 18, 2021
Run GPU and DLAs concurrently Jetson AGX Xavier dla	4	649	October 18, 2021
Compute time in DLA slower than expected Jetson AGX Orin dla	5	926	July 28, 2023
Does DLA work faster than GPU in fp16 model? Jetson AGX Xavier dla	18	2677	June 8, 2022
Is using DLA for inference really more energy efficient?? Jetson AGX Xavier	6	1529	October 18, 2021
High Latency in Gst-nvinfer When Using DLA vs. GPU DeepStream SDK tensorrt , camera , cuda , nsight , python , dla , deepstream	11	29	February 10, 2025
Decreased performance from FP16 to INT8 in TF-TRT on Jetson Xavier General	12	2685	October 12, 2021
Jetson AGX Xavier DDR Test Jetson AGX Xavier performance	16	1722	October 18, 2021
Model timing impacted when used Both DLA & GPU simultaneously Jetson AGX Xavier dla	5	716	December 28, 2022

Profiling DLA with GPU fallback on Jetson Xavier

Environment

Related topics