TensorRT inference slower than PyTorch, different tactics are being selected

Description

Hello everyone,

I have a straightforward model with a single Conv2d layer that takes an input of size [1, 9, 1232, 1832] and produces an output of size [1, 1, 1201, 1801]. While the model performs well in PyTorch, its performance drops by ~50% after converting it to TensorRT through ONNX. For the purpose of benchmarking, both the kernel and the input are generated randomly.

When I profile the PyTorch model using Nsight Systems, I notice that it invokes a particular tactic (kernel) named cudnn::cnn::conv2d_grouped_direct_kernel, at about 8ms per inference. On the other hand, the TensorRT model, which I obtain using trtexec.exe, opts for a different tactic: sm80_xmma_fprop_implicit_gemm_indexed_wo_smem_f32f32_tf32f32_f32_nhwckrsc_nhwc_tilesize128x16x64_stage1_warpsize4x1x1_g1_tensor16x8x8_alignc4. This tactic is noticeably slower, at about 12ms per inference. Adding to my confusion, the trtexec log displays a message saying “CudnnConvolution has no valid tactics for this config, skipping”. I’m left wondering why trtexec would bypass cuDNN tactics on my setup, especially when through PyTorch, the optimal tactic is being used.

Could anyone suggest what the reason for this discrepancy might be?

Thanks a lot!

Environment

TensorRT Version: 8.6.1.6
GPU Type: NVIDIA GeForce RTX 4090
Nvidia Driver Version: 545.92
CUDA Version: 12.1
CUDNN Version: 8.9.5
Operating System + Version: Windows 11 version 10.0.22621.2506
Python Version (if applicable): 3.10

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:

Thanks!