Description
Hi, we tried to run Resnet50 pretrained inference model on the DLA of Jetson AGX Orin using this code: GitHub - NVIDIA-AI-IOT/jetson_dla_tutorial: A tutorial for getting started with the Deep Learning Accelerator (DLA) on NVIDIA Jetson.
The resnet50 pretrained model is used from torhvision.models module. We have used the same format for creating the TRT builder, creating the TRT engine, creating IO buffers and doing the inference using async_exec_v2 function.
But, the slowdown in compute time for DLA, when compared to the GPU is much higher than expected.
For batch_size = 1, The compute time when GPU is used = 1.2 ms and the compute time when DLA is used = 19ms, which is 15x degradation.
Similarly, for batch_size = 16, The compute time when GPU is used = 8 ms and the compute time when DLA is used = 333ms, which is 40x degradation.
But, recent papers show a 3-5x degradation in DLA when compared to GPU. Can we please know why are we observing such a large degradation in DLA when compared to GPU ?
Environment
Device : Jetson AGX Orin
TensorRT Version: 8.4.0
Nvidia Driver Version: V11.4.239
CUDA Version: 11.4
Jetpack + Version: 5.0.1-b118
Python Version (if applicable): 3.8.10
Baremetal or Container (if container which image + tag): Baremetal