This line of code run normally with TensorRT 7.2.3.4 + CUDA 11.1, takes about 2 ms. But it takes 300 ms with TensorRT 8.0.3.4 + CUDA 11.2. Engines in both environments are converted from ONNX passed normally.
Environment
TensorRT Version: 7.2.3.4 + CUDA 11.1; 8.0.3.4 + CUDA 11.2 GPU Type: GTX 2080 TI Nvidia Driver Version: 470.141.03 CUDA Version: 11 CUDNN Version: 8.1.0 in both environments Operating System + Version: Ubuntu 18.04 Python Version (if applicable): TensorFlow Version (if applicable): PyTorch Version (if applicable): 1.9 Baremetal or Container (if container which image + tag):