TensorRT verbose log problem about GPU Compute time

Description

I made some model to test run time.
Models have only different at input one or two
First two group dual_input.onnx has two times GPU compute time comparing to single_input.onnx model.

./trtexec --onnx=/home/nvidia/input.onnx --explicitBatch --verbose --workspace=2048

Input size is (3 ,192, 256) in order (CHW).
Output size is (6400,48,64)

single_input6400.onnx, GPU compute mean: 9.07 ms
dual_input6400.onnx, GPU compute mean: 18.20 ms

Input size is (3 ,192, 256) in order (CHW).
Output size is (640,48,64)

single_input640.onnx, GPU compute mean: 0.88 ms
dual_input640.onnx, GPU compute mean: 1.75 ms

But this group has four times difference.
Input size is (3 ,192, 256) in order (CHW).
Output size is (64,48,64)

single_input64.onnx, GPU compute mean: 0.11 ms
dual_input64.onnx, GPU compute mean: 0.22 ms

According to Horizontal Layer fusion, the model with two input has horizontal fusion in first conv layer.

But the GPU compute time seems that no improvement in the test result.
Does horizontal merged convolution layer effect the run time?

All model’s are in Relevant Files zone.

Thank you

Environment

TensorRT Version : 7.1.3
GPU Type : Xavier
Nvidia Driver Version : Package:nvidia-jetpack, Version: 4.4.1-b50
CUDA Version : 10.2.89
CUDNN Version : 8.0.0
Operating System + Version : Ubuntu 18.04
Python Version (if applicable) :
TensorFlow Version (if applicable) :
PyTorch Version (if applicable) :
Baremetal or Container (if container which image + tag) :

Relevant Files

test.rar (2.4 MB)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi, Request you to share the model, script, profiler and performance output so that we can help you better.

Alternatively, you can try running your model with trtexec command
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec
or view these tips for optimizing performance
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html

Thanks!

Thank you for reply,

I have attached all models in Relevant Files, and re-write the post.

Thanks for help on this.

Hi @disculus2012,

Horizontal fusion is only applied if the output of the convolutions are not marked as output.
This is because if they are marked as outputs, we need to do another two copies to transform the concatenated convolution outputs into two network outputs, which cost more time than without the horizontal fusion.
To really test the the horizontal fusion, we could concatenate the two convolution outputs along the channel dimension, and add an additional convolution after that.

Thank you.