TensorRT produces unexpected NaN values during inference

Description

I’m working with an existing codebase, KataGo, that has two backends that users can select from for performing neural net inference. One backend uses CUDA, and the other uses TensorRT.

On most models and inputs, the CUDA and TensorRT backends produce the same result, as expected. However, on some model-input combinations, the TensorRT backend will occasionally output a bunch of NaNs partway through inference, and the resulting output is all NaNs or garbage.

Environment

TensorRT Version: 8.5.0
GPU Type: A6000 (also reproduced on A4000)
Nvidia Driver Version: 510.60.02
CUDA Version: 11.8
CUDNN Version: 8.6.0
Operating System + Version: Ubuntu 20.04
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/tensorrt:22.09-py3

Also reproduced on container nvcr.io/nvidia/tensorrt:22.08-py3 (TensorRT=8.4.2, CUDA=11.7, CUDNN=8.5.0)

Steps To Reproduce

A reproduction and description is here: KataGo GitHub issue

This is certainly not a minimal reproduction—it requires cloning this KataGo codebase—but I’m not going to spend more of my time trying to turn this into a more minimal reproduction and fully tracking down the source of this issue. As such, I probably can’t expect anyone to deeply investigate this for me, but I’d be curious to know whether this resembles any known TensorRT issues or driver issues. (It’s possible this is a bug in the KataGo TensorRT backend code, but the developer who wrote the backend suspects it’s an NVIDIA issue so I figured I would ask here.)

Hi,

Sorry, it would be great if you could provide us with minimal issue repro for better debugging.
Please refer to the following similar issue, which may help you.

Thank you.