Performance Bottleneck in TensorRT Inference on Jetson with Semantic Segmentation Model (DWConv)

Description

I am currently working on deploying a semantic segmentation model from PyTorch to TensorRT for inference on Jetson. After exporting the model to .onnx and optimizing it using TensorRT, I have encountered a performance bottleneck in a specific block of my model’s backbone. There are two instances of this block, and together they consume a total of 8 ms during inference.
The total arch (574 tensorrt Layers) average latency is 76ms it means that this 2 blocks alone take 21% of the time.

Has anyone any hint?

Environment

TensorRT Version: 7.1.3
GPU Type: Jetson Xavier AGX
Nvidia Driver Version:
CUDA Version: 10.2.89
CUDNN Version: 8.0.0.180
Operating System + Version: Ubuntu 18.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.13.1
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/lt4t-ml:r32.7.1-py3

Relevant Files

This is the block with releated I/O shapes:

Hi @user3618 ,
Apologies for delayed response, can you please help us with model and repro script.