I am trying to evaluate the ratio of inference time between DLA and GPU in the case of DLA inference with fallback to GPU for non-supported layers/configs.
To my knowledge, the only way to do that is using the Nsight Systems profiler.
I am using Mobilenet-v1 example from onnx and with trtexec with batchsize=1. I was able to get profiling reports with DLA activation and tasks. However, I see some inconsistencies between the TensorRT timeline and the DLA timeline.
This is the Nsight report:
The first inference pass, is expectedly slower and corresponds to the DLA timeline, however for the next passes/iterations, the TensorRT inference duration (yellow) is much smaller than the DLA task (purple).
My questions are the following:
- Does the report make sense to you, and what is the rest of the timing of the DLA task where there are no layers executing?
- Concretely, I want to quantify (time_on_DLA / inference_time)% in order to estimate the power efficiency between GPU only vs DLA with GPU fallback, however, the DLA task time is larger than the whole layers inference time, how can I estimate this?
I have the layers running on each device, but I am not sure I should count the TensorRT timeline duration for each layer, because of the difference with corresponding DLA task timeline.
Or do you propose another way to approach the problem? I don’t have much knowledge on Nsight systems.
Here is the list of layers assignment for reference:
Any help would be much appreciated.
TensorRT Version : 188.8.131.52
GPU Type : Jetson Xavier
Nvidia Driver Version : from JetPack 4.4.1
CUDA Version : 10.2.89
CUDNN Version : 184.108.40.206
Operating System + Version : Ubuntu 18.04
Python Version (if applicable) : Python 3.6
TensorFlow Version (if applicable) :
PyTorch Version (if applicable) :
Baremetal or Container (if container which image + tag) : Baremetal