Trt ouput mismatch with onnx output

Description

Trt ouput mismatch with onnx output. Use polygraph run will failed with “Difference exceeds tolerance”.

Onnx file: https://pan.baidu.com/s/1qd3NSrqIU-aJ4ZrHxO97Ag?pwd=43rg 提取码: 43rg

Polygraph cmd:

polygraphy run /tmp/Janus-Pro-7B/vision_encoder_bfp16.onnx --onnxrt --trt \
                                                                       --save-engine=/tmp/Janus-Pro-7B/vision_encoder_bfp16.trt \
                                                                       --trt-min-shapes 'input:[1,3,384,384]' \
                                                                       --trt-opt-shapes 'input:[1,3,384,384]' \
                                                                       --trt-max-shapes 'input:[8,3,384,384]' \
                                                                       --input-shapes   'input:[-1,3,384,384]' \
                                                                       --atol 1e-1 --rtol 1e-1 \
                                                                       --fail-fast

Failed log:

[I] trt-runner-N0-02/20/25-22:26:11     | Completed 1 iteration(s) in 1723 ms | Average inference time: 1723 ms.
[I] Accuracy Comparison | onnxrt-runner-N0-02/20/25-22:26:11 vs. trt-runner-N0-02/20/25-22:26:11
[I]     Comparing Output: 'output' (dtype=float16, shape=(1, 576, 4096)) with 'output' (dtype=float16, shape=(1, 576, 4096))
[I]         Tolerance: [abs=0.1, rel=0.1] | Checking elemwise error
[I]         onnxrt-runner-N0-02/20/25-22:26:11: output | Stats: mean=-0.035358, std-dev=4.1854, var=17.517, median=-0.0059319, min=-303.75 at (0, 121, 2526), max=102.44 at (0, 121, 411), avg-magnitude=2.5026
[I]             ---- Histogram ----
                Bin Range      |  Num Elems | Visualization
                (-304 , -263 ) |          2 |
                (-263 , -222 ) |          1 |
                (-222 , -182 ) |          1 |
                (-182 , -141 ) |          1 |
                (-141 , -101 ) |         13 |
                (-101 , -60  ) |        107 |
                (-60  , -19.4) |       5218 |
                (-19.4, 21.2 ) |    2350202 | ########################################
                (21.2 , 61.8 ) |       3692 |
                (61.8 , 102  ) |         59 |
[I]         trt-runner-N0-02/20/25-22:26:11: output | Stats: mean=-0.015617, std-dev=1.5098, var=2.2795, median=-0.0024776, min=-94.25 at (0, 121, 2526), max=43.531 at (0, 121, 3649), avg-magnitude=0.99653
[I]             ---- Histogram ----
                Bin Range      |  Num Elems | Visualization
                (-304 , -263 ) |          0 |
                (-263 , -222 ) |          0 |
                (-222 , -182 ) |          0 |
                (-182 , -141 ) |          0 |
                (-141 , -101 ) |          0 |
                (-101 , -60  ) |          1 |
                (-60  , -19.4) |         79 |
                (-19.4, 21.2 ) |    2359190 | ########################################
                (21.2 , 61.8 ) |         26 |
                (61.8 , 102  ) |          0 |
[I]         Error Metrics: output
[I]             Minimum Required Tolerance: elemwise error | [abs=251.31] OR [rel=3.9922e+06] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=2.4379, std-dev=3.0746, var=9.453, median=1.4648, min=0 at (0, 0, 3844), max=251.31 at (0, 77, 2526), avg-magnitude=2.4379
[I]                 ---- Histogram ----
                    Bin Range    |  Num Elems | Visualization
                    (0   , 25.1) |    2355495 | ########################################
                    (25.1, 50.3) |       3576 |
                    (50.3, 75.4) |        194 |
                    (75.4, 101 ) |         21 |
                    (101 , 126 ) |          4 |
                    (126 , 151 ) |          1 |
                    (151 , 176 ) |          1 |
                    (176 , 201 ) |          1 |
                    (201 , 226 ) |          2 |
                    (226 , 251 ) |          1 |
[I]             Relative Difference | Stats: mean=28.596, std-dev=3638.5, var=1.3238e+07, median=1.9318, min=0 at (0, 0, 3844), max=3.9922e+06 at (0, 66, 2211), avg-magnitude=28.596
[I]                 ---- Histogram ----
                    Bin Range            |  Num Elems | Visualization
                    (0       , 3.99e+05) |    2359282 | ########################################
                    (3.99e+05, 7.98e+05) |          7 |
                    (7.98e+05, 1.2e+06 ) |          4 |
                    (1.2e+06 , 1.6e+06 ) |          1 |
                    (1.6e+06 , 2e+06   ) |          0 |
                    (2e+06   , 2.4e+06 ) |          1 |
                    (2.4e+06 , 2.79e+06) |          0 |
                    (2.79e+06, 3.19e+06) |          0 |
                    (3.19e+06, 3.59e+06) |          0 |
                    (3.59e+06, 3.99e+06) |          1 |
[E]         FAILED | Output: 'output' | Difference exceeds tolerance (rel=0.1, abs=0.1)
[E] FAILED | Runtime: 24.394s | Command: /usr/local/bin/polygraphy run /tmp/Janus-Pro-7B/vision_encoder_bfp16.onnx --onnxrt --trt --save-engine=/tmp/Janus-Pro-7B/vision_encoder_bfp16.trt --trt-min-shapes input:[1,3,384,384] --trt-opt-shapes input:[1,3,384,384] --trt-max-shapes input:[8,3,384,384] --input-shapes input:[-1,3,384,384] --atol 1e-1 --rtol 1e-1 --fail-fast

Environment

TensorRT Version: 10.7.0

NVIDIA GPU: A10

NVIDIA Driver Version: 550.90.07

CUDA Version: 12.6

CUDNN Version: 9.6.0

Operating System: Ubuntu 24.04.1 LTS \n \l

Python Version (if applicable): 3.12.3

Hi @450959507 ,

Can you pls check the below information and let me know if this works?

  1. Output Mismatch Tolerances:
  • Issue: Output mismatches can occur when comparing TensorRT outputs against ONNX outputs, leading to the error “Difference exceeds tolerance” in Polygraphy.
  • Solution: Adjust the absolute tolerance (atol) and relative tolerance (rtol) values used during comparisons. Many users find that increasing tolerance values, such as atol=1e-3 or 1e-1, allows for successful comparisons, accommodating real-world data variability.
  1. Calibration File Issues:
  • Issue: Calibration files generated with one version of TensorRT may not perform adequately with another version, potentially affecting performance and accuracy.
  • Solution: Recreate calibration files with the version of TensorRT currently in use. Reusing old files can lead to performance regressions, so always regenerate them after version changes.
  1. Engine Generation Errors:
  • Issue: The conversion process from ONNX to TensorRT can yield errors or suboptimal performance on different hardware.
  • Solution: Ensure the ONNX model uses supported data types and consider downgrading the ONNX opset version to lower numbers if issues arise. Some features in newer versions may not be fully supported.
  1. Dynamic Shapes and Configuration Issues:
  • Issue: Implementing models with dynamic input shapes can lead to incorrect outputs.
  • Solution: Define proper optimization profiles for dynamic shapes, ensuring minimum, optimal, and maximum shapes are accurately set up during engine creation.
  1. Using Polygraphy for Diagnostics:
  • Recommendation: Leverage Polygraphy to perform detailed comparisons and diagnostics. Use commands like polygraphy run <onnx_model> --trt --onnxrt to identify discrepancies. Monitoring logs at increased verbosity can aid in troubleshooting.
    Thanks