TensorRT EfficientNMS plugin FP16 inconsistent(but valid) results

Description

Hello!

I encountered an issue with the EfficientNMS plugin and its FP16 mode and I’m wondering whether this can be treated as a bug or it is somehow expected/can be explained.

Running the inference multiple times on the same image in the float16 mode very often results in a slightly different output, one or few pixels off. Example:
1st run [104 x 75 from (142, 156)] vs 2nd run [105 x 73 from (141, 158)] with same confidence score.
Also sometimes the order of sorted bounding boxes with the same score is different than in the previous iteration.

I consider the detections valid as they are pretty close. This is a problem for my testing plan as I assumed that when re-using the same TensorRT plan file I would always get deterministic results which has been the case until now.
Are there any ideas how to explain this behavior?
If it’s a bug I’ll try to prepare reproduction using publicly available models/data.

Isolation:

  • issue occurs re-using the same TRT plan
  • using FP32 - the issue does NOT occur
  • using FP16 with plugins around EfficientNMS forced to FP32 - the issue does NOT occur
  • confirmed that the NMS plugin gets the same data in each iteration and sometimes outputs slightly different (dumped and compared raw bytes)
  • not all the images are “problematic”, mostly those with many detections

Thank you!

Environment

TensorRT Version: 8.4 - 8.5.1.7 ( previous versions don’t work because of other issue )
GPU Type: RTX2070
Nvidia Driver Version: 525.60.13
CUDA Version: 11.8
CUDNN Version: 8.6.0
Operating System + Version: Ubuntu 18.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): baremetal

Relevant Files

Steps To Reproduce

Hi,

We are checking on this issue internally.
Could you please share with us minimal issue repro model/script for better debugging.

Thank you.

Thank you for the reply.
I will try to create some minimum setup to reproduce this issue based on publicly available data. Please allow me some time as I cannot share our projects code/models here.

1 Like