Compute time in DLA slower than expected


Hi, we tried to run Resnet50 pretrained inference model on the DLA of Jetson AGX Orin using this code: GitHub - NVIDIA-AI-IOT/jetson_dla_tutorial: A tutorial for getting started with the Deep Learning Accelerator (DLA) on NVIDIA Jetson.

The resnet50 pretrained model is used from torhvision.models module. We have used the same format for creating the TRT builder, creating the TRT engine, creating IO buffers and doing the inference using async_exec_v2 function.

But, the slowdown in compute time for DLA, when compared to the GPU is much higher than expected.

For batch_size = 1, The compute time when GPU is used = 1.2 ms and the compute time when DLA is used = 19ms, which is 15x degradation.

Similarly, for batch_size = 16, The compute time when GPU is used = 8 ms and the compute time when DLA is used = 333ms, which is 40x degradation.

But, recent papers show a 3-5x degradation in DLA when compared to GPU. Can we please know why are we observing such a large degradation in DLA when compared to GPU ?


Device : Jetson AGX Orin
TensorRT Version: 8.4.0
Nvidia Driver Version: V11.4.239
CUDA Version: 11.4
Jetpack + Version: 5.0.1-b118
Python Version (if applicable): 3.8.10
Baremetal or Container (if container which image + tag): Baremetal

Hi @AakankshaS thanks for the reply.
We have already gone through the documentation and the related links that you’ve shared. But, none of them answers our queries as to what is causing the slowdown in the DLA. Request you to please forward this query to the relevant team.

Thanks !


Have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

More, could you try it with INT8 mode?
Orin’s DLA FP16 mode is expected to run slower by design.
More details can be found in the below topic:


