Different FP16 inference behavior on four GPUs

Description

I have an onnx model exported from torchvision Faster-RCNN. When deploying the model using TensorRT-8.2’s trtexec, I obtained correct results under FP32, tested on four GPUs (namely, JetsonTX2-NX, RTX2080, RTX3060, RTX3080Ti). However, when switched to FP16 mode (i.e., --fp16), the results on JetsonTX2-NX and RTX2080 degrade equally and significantly from FP32, while RTX3060 and RTX3080Ti retain their accuracy.

I installed and built trtexec on each device based on the same method. The CUDA versions on these devices are slightly different (e.g., CUDA10.x on JetsonTX2 and CUDA11.x on the other three devices). Since their FP32 results are all correct, I believe the slight differences in their CUDA, CUDNN, and OS versions did not cause the degradation under FP16.

Instead, I am more inclined to belive that the issue is associated with the supported data type / precision on each device? I wonder if you have any explanation on this issue and perhaps some insights on how to address it?

Many thanks in advance!

Environment

TensorRT Version: 8.2.1.8
GPU Type: JetsonTX2-NX, RTX2080, RTX3060, RTX3080Ti
Nvidia Driver Version:
CUDA Version: 10 and 11
CUDNN Version: 8.6
Operating System + Version: Ubuntu 18.04 / 20.04

Relevant Files

More than happy to provide my model via private messages. Also happy to provide detailed logs if needed (all four devices display different trtexec logs, which makes it quite difficult and potentially confusing to put here).

Hi,

Could you please let us know how big the divergence is between the devices?
Please share with us verbose logs and an issue repro.
Three of those devices are different generations with different hardware accelerations for FP16. That alone can cause some divergence, and even with a single device, building the same network twice can result in slightly different numerical accuracy.

Thank you.

Thank you for your response!

It is perhaps more straightforward if I just focus on the problematic device here (i.e., Jetson TX2NX). The divergence between its FP32 and FP16 is highly significant (in my case the model is a detector, and the bounding boxes of FP16 are very poor).

Through DM, I have shared with you my onnx model, TensorRT OSS source files (i.e., to build new libnvinfer_plugin.so, onnxparser, or even trtexec if needed), and a sampled image (.dat and .jpg). The OSS source files are needed as I had to include a custom plugin (I am using TensorRT-8.2.1.8 which doesn’t natively support RoiAlign in my model; the RoiAlign plugin was taken from a later version of TensorRT-OSS).

Steps to reproduce:

  1. Build new shared libraries (.so) from the provided TensorRT-OSS files (following standard build steps).

  2. Update LD_LIBRARY_PATH to accommodate newly built .so files.

  3. Generate trt engine from the provided onnx model using the following commands:

trtexec --onnx=model.onnx --loadInputs=“input0:test_image.dat” --dumpOutput

trtexec --onnx=model.onnx --loadInputs=“input0:test_image.dat” --dumpOutput --fp16

The target device was a Jetson TX2NX. Outputs of the above two commands mismatch drastically (the first one corresponds to FP32 and is correct). I also provide you the verbose logs of the above two commands.

Please help look into this matter and don’t hesitate to let me know if you seek more details. Thank you in advance!

Hi,

Sorry for the delayed response.
We could reproduce the same behavior.
Please allow us some time to work on this issue.

Thank you.