TensorRT inference significantly slower using kernel size 11 vs 3

rolandkajatin · February 9, 2021, 9:39am

Description

I have a TensorRT optimized engine with some convolutional layers at the end. Using a kernel size of 11 (in/out channels 128) takes significantly more time than using a kernel size of 3 (in/out channels 128). I am running on a Jetson Xavier NX board with JetPack 4.5.

I have attached the two log files with profiling at the end. The layers in question are called Conv_23 + Relu_24 and Conv_25 + Relu_26. The total inference time goes from about 40 ms to 2 ms.

Am I doing something wrong? I train my model using PyTorch; measuring inference time in Python gives roughly the same (~2 ms) for both the larger and smaller kernel sizes.

Environment

TensorRT Version: 7.1.3-1
Platform: Nvidia Jetson Xavier NX
JetPack Version: 4.5
L4T Version: 32.5
CUDA Version: 10.2.89
CUDNN Version: 8.0.0
Operating System + Version: Ubuntu 18.04.5 LTS

Relevant Files

Link to the log files.

NVES · February 9, 2021, 2:57pm

Hi, Request you to share the model, script, profiler and performance output so that we can help you better.

Alternatively, you can try running your model with trtexec command
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec
or view these tips for optimizing performance
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html

Thanks!

rolandkajatin · February 10, 2021, 7:33am

Hi,

Thank you for your reply.

I have attached the log file created by trtexec in my original post. You can see the network architecture there; as well as the execution profile dump at the end of said log files.

My concern is regarding one single change, that makes such a huge difference in execution times of two convolutional layers. As I said in my original post, using a kernel size of 11 compared to a size of 3, the latency drops from 40 ms to 2 ms. I do not understand why, using a kernel size of 11 would be such a burden for the TensorRT optimized engine.

spolisetty · February 10, 2021, 9:31am

Hi @rolandkajatin,

Given the same parameters of a Conv, the computation required by kernel size 11 is about 13.4x of kernel size 3. So the long running time of kernel size 11 should be expected in general (though we need to investigate in detail).

Hope this issue is not blocker for you.
If yes, please share us onnx model and issue reproduce steps to debug.

Thank you.