I have a TensorRT optimized engine with some convolutional layers at the end. Using a kernel size of 11 (in/out channels 128) takes significantly more time than using a kernel size of 3 (in/out channels 128). I am running on a Jetson Xavier NX board with JetPack 4.5.
I have attached the two log files with profiling at the end. The layers in question are called
Conv_23 + Relu_24 and
Conv_25 + Relu_26. The total inference time goes from about 40 ms to 2 ms.
Am I doing something wrong? I train my model using PyTorch; measuring inference time in Python gives roughly the same (~2 ms) for both the larger and smaller kernel sizes.
TensorRT Version: 7.1.3-1
Platform: Nvidia Jetson Xavier NX
JetPack Version: 4.5
L4T Version: 32.5
CUDA Version: 10.2.89
CUDNN Version: 8.0.0
Operating System + Version: Ubuntu 18.04.5 LTS
Link to the log files.