--useDLACore + --allowGPUFallback is significantly slower

user3705 · November 12, 2021, 2:53am

Please provide the following info (check/uncheck the boxes after creating this topic):
Software Version
DRIVE OS Linux 5.2.6
DRIVE OS Linux 5.2.6 and DriveWorks 4.0
DRIVE OS Linux 5.2.0
DRIVE OS Linux 5.2.0 and DriveWorks 3.5
NVIDIA DRIVE™ Software 10.0 (Linux)
NVIDIA DRIVE™ Software 9.0 (Linux)
other DRIVE OS version
other

Target Operating System
Linux
QNX
other

Hardware Platform
NVIDIA DRIVE™ AGX Xavier DevKit (E3550)
NVIDIA DRIVE™ AGX Pegasus DevKit (E3550)
other

SDK Manager Version
1.7.0.8846
other

Host Machine Version
native Ubuntu 18.04
other

I am trying to target an onnx model with trtexec and use of DLACore with GPUFallBack; I see a significant slowdown. When I dumped the profile for each layer, it seems like the graph is getting split between DLA and GPU and there is frequent transfer of data between DLA and GPU.

In particular, I found that layer 540 in attached screen shot is the cause for the slowdown:

Digging further in the verbose log, I see that the layer in question is a Concat operator of 4 different tensors.

In order to achieve the concat, the compiler appears to be reformatting the input. Is this what is going on? How do I visualize this graph and why is the tensor getting broken down into smaller tensors.

SivaRamaKrishnaNV · November 12, 2021, 6:48am

Dear @user3705,
Just to clarify, when GPUfallback is enabled, non DLA supported layers move back to GPU. So, it involves data transfer across DLA and GPU to share intermediate output buffers. If you want to run whole model on DLA, you need to change model to have DLA supported layers. Please attach your model to get more insights.

user3705 · November 12, 2021, 4:12pm

Hi,

Here is the breakdown of layers running on DLA vs. GPU. Why would normal layers like Conv, Relu and Pooling fail to run on DLA and get assigned to GPU. Would it be due to DLA resource constraints? What specific constraints prevent the DLA from implementing these layers?

SivaRamaKrishnaNV · November 15, 2021, 3:34am

Dear @user3705,
Could you please check “DLA supported layers-> Layer specific restrictions” in TensorRT developer guide

Topic		Replies	Views
Resnet50 with DLA takes 2x more latency than with just GPU DRIVE AGX Xavier General driveos-dl	13	1200	November 19, 2021
Layer shuffling on GPU and DLA Jetson AGX Orin tensorrt , cuda , jetson-inference , dla , gpu , kb	6	929	May 10, 2023
Does DLA work faster than GPU in fp16 model? Jetson AGX Xavier dla	18	2733	June 8, 2022
DLA enabled Network considerably slower Jetson AGX Xavier dla	2	809	October 18, 2021
Getting less throughput while enabling DLAs on Jetson AGX Orin Jetson AGX Orin dla	5	764	February 23, 2023
Choosing --useDLACore=1 option dumping the core Jetson AGX Xavier dla	6	645	November 29, 2023
Performance about igpu and dla DRIVE AGX Xavier General driveos-dl	9	1288	October 12, 2021
DLA trtexec questions Jetson AGX Xavier	4	1803	October 18, 2021
falling back to GPU DRIVE AGX Xavier General	5	671	October 12, 2021
TensorRT 5.0 problem on TegraXavier TensorRT	2	950	January 22, 2019

--useDLACore + --allowGPUFallback is significantly slower

Related topics