--useDLACore + --allowGPUFallback is significantly slower

Please provide the following info (check/uncheck the boxes after creating this topic):
Software Version
DRIVE OS Linux 5.2.6
DRIVE OS Linux 5.2.6 and DriveWorks 4.0
DRIVE OS Linux 5.2.0
DRIVE OS Linux 5.2.0 and DriveWorks 3.5
NVIDIA DRIVE™ Software 10.0 (Linux)
NVIDIA DRIVE™ Software 9.0 (Linux)
other DRIVE OS version
other

Target Operating System
Linux
QNX
other

Hardware Platform
NVIDIA DRIVE™ AGX Xavier DevKit (E3550)
NVIDIA DRIVE™ AGX Pegasus DevKit (E3550)
other

SDK Manager Version
1.7.0.8846
other

Host Machine Version
native Ubuntu 18.04
other

I am trying to target an onnx model with trtexec and use of DLACore with GPUFallBack; I see a significant slowdown. When I dumped the profile for each layer, it seems like the graph is getting split between DLA and GPU and there is frequent transfer of data between DLA and GPU.

In particular, I found that layer 540 in attached screen shot is the cause for the slowdown:

Digging further in the verbose log, I see that the layer in question is a Concat operator of 4 different tensors.

In order to achieve the concat, the compiler appears to be reformatting the input. Is this what is going on? How do I visualize this graph and why is the tensor getting broken down into smaller tensors.

Dear @user3705,
Just to clarify, when GPUfallback is enabled, non DLA supported layers move back to GPU. So, it involves data transfer across DLA and GPU to share intermediate output buffers. If you want to run whole model on DLA, you need to change model to have DLA supported layers. Please attach your model to get more insights.

Hi,

Here is the breakdown of layers running on DLA vs. GPU. Why would normal layers like Conv, Relu and Pooling fail to run on DLA and get assigned to GPU. Would it be due to DLA resource constraints? What specific constraints prevent the DLA from implementing these layers?

Dear @user3705,
Could you please check “DLA supported layers-> Layer specific restrictions” in TensorRT developer guide