TensorRT stuck on tuning plugin in FP16 mode

Description

I have an ONNX model that includes the GridSampler op. Since this is a custom op, I have built the corresponding plugin (from onnxparser-trt-plugin-sample/TensorRT/plugin/gridSamplerPlugin at master · TrojanXu/onnxparser-trt-plugin-sample · GitHub) as a shared lib and load it when building the engine.

The model is converted fine in FP32 mode, but in FP16 mode the builder stuck on this stage:

[10/20/2022-11:02:28] [TRT] [V] =============== Computing costs for 
[10/20/2022-11:02:28] [TRT] [V] *************** Autotuning format combination: Float(10240000,40000,200,1), Float(80000,400,2,1) -> Float(10240000,40000,200,1) ***************
[10/20/2022-11:02:28] [TRT] [V] --------------- Timing Runner: grid_sampler_1021 (PluginV2)
[10/20/2022-11:02:28] [TRT] [V] Tactic: 0x0000000000000000 Time: 0.181931
[10/20/2022-11:02:28] [TRT] [V] Fastest Tactic: 0x0000000000000000 Time: 0.181931
[10/20/2022-11:02:28] [TRT] [V] >>>>>>>>>>>>>>> Chose Runner Type: PluginV2 Tactic: 0x0000000000000000
[10/20/2022-11:02:28] [TRT] [V] *************** Autotuning format combination: Half(10240000,40000,200,1), Half(80000,400,2,1) -> Half(10240000,40000,200,1) ***************
[10/20/2022-11:02:28] [TRT] [V] --------------- Timing Runner: grid_sampler_1021 (PluginV2)

It stucks there forever, with 100% GPU utilization:

$ nvidia-smi
Thu Oct 20 11:07:32 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    On   | 00000000:03:00.0 Off |                  Off |
| 30%   45C    P2   105W / 300W |  10351MiB / 49140MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    737234      C   python                          10348MiB |
+-----------------------------------------------------------------------------+

I am wondering what could be the reason for this?

I am running the job in Docker container derived from nvcr.io/nvidia/pytorch:22.07-py3

Environment

TensorRT Version: 8.4.1.5
GPU Type: NVIDIA RTX A6000
Nvidia Driver Version: 520.61.05
CUDA Version: 11.7 Update 1 Preview
CUDNN Version:
Operating System + Version: Ubuntu 20.04
Python Version (if applicable): 3.8.13
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): PyTorch Release Notes :: NVIDIA Deep Learning Frameworks Documentation

Could you please share with us complete verbose logs and if possible issue a repro ONNX model and command/steps to try from our end for better debugging.

Thank you.