Inference fp16 engine in c++ get Nan output but inference fp32 engine can get correct result

Description

Hi, the TRT team. When i inference a fp16 tensorrt engine in c++, i got the Nan output, but with the same c++ code and just replace the fp16 model with fp32 model, the output was correct. And, both the fp16 model and fp32 model can get correct outputj in python code. So, is there any change to do when i inference fp16 model in c++ code?

i generate the fp16 model with trtexec.exe

trtexec.exe --onnx=best_fp16.onnx --saveEngine=best_fp16.engine --fp16

And i upload my fp32 onnx model named best.onnx, fp32 engine named best.engine, fp16 onnx model named best_fp16.onnx, fp16 engine named best_fp16.engine

Environment

TensorRT Version: 8.2.1.8
GPU Type: RTX3050
Nvidia Driver Version: 528.49
CUDA Version: 11.4
CUDNN Version: 8.2.2.26
Operating System + Version: Windows10
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.11.0
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)
best.engine (13.2 MB)
best.onnx (7.9 MB)
best_fp16.engine (7.9 MB)
best_fp16.onnx (4.0 MB)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi,
Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

Also, request you to share your model and script if not shared already so that we can help you better.

Meanwhile, for some common errors and queries please refer to below link:

Thanks!

hi, thanks for your reply.
this is the verbose log(trt.log), for the script i will upload latter.
trt.log (7.4 MB)

and this is the script.
main_detect.cpp (6.6 KB)
cuda_utils.h (417 Bytes)
post_process.cpp (4.2 KB)
post_process.h (643 Bytes)
preprocess.cu (5.8 KB)
preprocess.h (522 Bytes)

Hi,

We observed an accuracy difference in FP16 precision when we tried to reproduce the issue.
Please allow us some time to work on this issue.

Thank you.

Hi, spolisetty, AakankshaS
How is it going?

i update TensorRT to 8.6.1.6, the problem remains.

And i tested another model yolov5s, it could get correct results in python both in fp32 and fp16, but it got Nan whether in fp32 or fp16. So i guess, maybe, the problem is in my preprocess code?

I had encountered the problem too, I have a tf model, when I use 8.5.1.7 + fp16, all score is ok, but when I update trt to 8.6.1 + fp16, all score nan.

due to I need multi profile + refit, so I had to update to 8.6

My problem solved, onnx + fp32 input for 8.6.1 +fp16 has problem, but it is ok for 8.5.1. So I convert onnx to fp16 before feed to trt, it run correct.

My onnx was fp16 and after converted it to fp16 engine, the ouput was Nan in c++,but the ouput was correct in python. And i tried TensorRT 8.6.1 and 8.5.1, things didn’t get better. My onnx was from the yolov5/export.py.

The problem have been resolved. Just use the fp32 onnx model for generating the tensorrt engine, whether for fp32 engine or fp16 engine.

When using FP16 to represent absolute values of ~1000, the accuracy loss is quite expected.
Our general advice is to either limit the maximum absolute value of the activations or explicitly set the precision of the affected layers to FP32.

Thank you.