Semantic Segmentation Model output difference beyond 1e-3 tolerance when converted from Onnx to TensorRT

Description

I am converting a semantic segmentation model using from Onnx to TensorRT format. The model gets converted successfully but the outputs are not within the acceptable atol and rtol of 1e-3. I am using Polygraphy to convert the model and I am using the following command -

polygraphy run model.onnx --trt --onnxrt --precision-constraints obey --save-engine model.engine --providers CUDAExecutionProvider --atol 1e-3 --rtol 1e-3

I will need some help on if there are some customizations I can make to ensure the atol and rtol are within 1e-3 for the TensorRT model. Model accuracy is utmost importance in our case.

Environment

TensorRT Version: 10.8.0.43
GPU Type: GeForce RTX 2060
Nvidia Driver Version: 572.60
CUDA Version: 12.8
CUDNN Version: v9.5
Operating System + Version: Windows 24H2 26100.3476
Python Version (if applicable): 3.12.9
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): Baremetal

Relevant Files

I can’t share the model files here. But I am happy to share the models files privately.

Steps To Reproduce

Please run the polygraphy command mentioned above.

(onnx-opt) C:\Users\msrir\dev\VisionKit\customers\aiv>polygraphy run model.onnx --trt --onnxrt --precision-constraints obey --save-engine model.engine --providers CUDAExecutionProvider --atol 1e-3 --rtol 1e-3
['\x1b[38;5;14m'][I] RUNNING | Command: \\?\C:\Users\msrir\anaconda3\envs\onnx-opt\Scripts\polygraphy run model.onnx --trt --onnxrt --precision-constraints obey --save-engine model.engine --providers CUDAExecutionProvider --atol 1e-3 --rtol 1e-3
[I] TF32 is disabled by default. Turn on TF32 for better performance with minor accuracy differences.
[I] TF32 is disabled by default. Turn on TF32 for better performance with minor accuracy differences.
['\x1b[38;5;14m'][I] trt-runner-N0-03/13/25-15:32:52     | Activating and starting inference
[I] Configuring with profiles:[
        Profile 0:
            {data [min=[1, 3, 512, 512], opt=[1, 3, 512, 512], max=[1, 3, 512, 512]]}
    ]
['\x1b[38;5;11m'][W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
['\x1b[38;5;14m'][I] Building engine with configuration:
    Flags                  | [OBEY_PRECISION_CONSTRAINTS]
    Engine Capability      | EngineCapability.STANDARD
    Memory Pools           | [WORKSPACE: 6143.69 MiB, TACTIC_DRAM: 6143.69 MiB, TACTIC_SHARED_MEMORY: 1024.00 MiB]
    Tactic Sources         | [EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
    Profiling Verbosity    | ProfilingVerbosity.DETAILED
    Preview Features       | [PROFILE_SHARING_0806]
['\x1b[38;5;10m'][I] Finished engine building in 106.870 seconds
[I] trt-runner-N0-03/13/25-15:32:52
    ---- Inference Input(s) ----
    {data [dtype=float32, shape=(1, 3, 512, 512)]}
[I] trt-runner-N0-03/13/25-15:32:52
    ---- Inference Output(s) ----
    {output [dtype=float32, shape=(1, 3, 512, 512)]}
['\x1b[38;5;10m'][I] trt-runner-N0-03/13/25-15:32:52     | Completed 1 iteration(s) in 2824 ms | Average inference time: 2824 ms.
['\x1b[38;5;14m'][I] onnxrt-runner-N0-03/13/25-15:32:52  | Activating and starting inference
['\x1b[38;5;14m'][I] Creating ONNX-Runtime Inference Session with providers: ['CUDAExecutionProvider']
[I] onnxrt-runner-N0-03/13/25-15:32:52
    ---- Inference Input(s) ----
    {data [dtype=float32, shape=(1, 3, 512, 512)]}
[I] onnxrt-runner-N0-03/13/25-15:32:52
    ---- Inference Output(s) ----
    {output [dtype=float32, shape=(1, 3, 512, 512)]}
['\x1b[38;5;10m'][I] onnxrt-runner-N0-03/13/25-15:32:52  | Completed 1 iteration(s) in 761 ms | Average inference time: 761 ms.
['\x1b[38;5;14m'][I] Accuracy Comparison | trt-runner-N0-03/13/25-15:32:52 vs. onnxrt-runner-N0-03/13/25-15:32:52
['\x1b[38;5;14m'][I]     Comparing Output: 'output' (dtype=float32, shape=(1, 3, 512, 512)) with 'output' (dtype=float32, shape=(1, 3, 512, 512))
[I]         Tolerance: [abs=0.001, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-03/13/25-15:32:52: output | Stats: mean=0.33328, std-dev=0.47133, var=0.22215, median=8.0377e-07, min=1.84e-07 at (0, 1, 2, 6), max=0.99997 at (0, 0, 4, 6), avg-magnitude=0.33328, p90=0.99997, p95=0.99997, p99=0.99997
[I]             ---- Histogram ----
                Bin Range       |  Num Elems | Visualization
                (1.68e-07, 0.1) |     524288 | ########################################
                (0.1     , 0.2) |          0 |
                (0.2     , 0.3) |          0 |
                (0.3     , 0.4) |          0 |
                (0.4     , 0.5) |          0 |
                (0.5     , 0.6) |          0 |
                (0.6     , 0.7) |          0 |
                (0.7     , 0.8) |          0 |
                (0.8     , 0.9) |          0 |
                (0.9     , 1  ) |     262144 | ####################
[I]         onnxrt-runner-N0-03/13/25-15:32:52: output | Stats: mean=0.33328, std-dev=0.47133, var=0.22215, median=6.5396e-07, min=1.6816e-07 at (0, 1, 0, 6), max=0.99998 at (0, 0, 4, 5), avg-magnitude=0.33328, p90=0.99998, p95=0.99998, p99=0.99998
[I]             ---- Histogram ----
                Bin Range       |  Num Elems | Visualization
                (1.68e-07, 0.1) |     524288 | ########################################
                (0.1     , 0.2) |          0 |
                (0.2     , 0.3) |          0 |
                (0.3     , 0.4) |          0 |
                (0.4     , 0.5) |          0 |
                (0.5     , 0.6) |          0 |
                (0.6     , 0.7) |          0 |
                (0.7     , 0.8) |          0 |
                (0.8     , 0.9) |          0 |
                (0.9     , 1  ) |     262144 | ####################
[I]         Error Metrics: output
[I]             Minimum Required Tolerance: elemwise error | [abs=0.0016909] OR [rel=0.44752] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=4.8133e-06, std-dev=4.7414e-05, var=2.2481e-09, median=1.4981e-07, min=6.8212e-12 at (0, 1, 0, 433), max=0.0016909 at (0, 0, 0, 422), avg-magnitude=4.8133e-06, p90=7.1526e-06, p95=7.1526e-06, p99=7.5102e-06
[I]                 ---- Histogram ----
                    Bin Range            |  Num Elems | Visualization
                    (6.82e-12, 0.000169) |     784397 | ########################################
                    (0.000169, 0.000338) |        502 |
                    (0.000338, 0.000507) |        305 |
                    (0.000507, 0.000676) |        298 |
                    (0.000676, 0.000845) |         16 |
                    (0.000845, 0.00101 ) |         56 |
                    (0.00101 , 0.00118 ) |        208 |
                    (0.00118 , 0.00135 ) |        334 |
                    (0.00135 , 0.00152 ) |        178 |
                    (0.00152 , 0.00169 ) |        138 |
[I]             Relative Difference | Stats: mean=0.10717, std-dev=0.093952, var=0.008827, median=0.09343, min=1.193e-07 at (0, 0, 2, 436), max=0.44752 at (0, 1, 510, 510), avg-magnitude=0.10717, p90=0.22908, p95=0.22908, p99=0.22908
[I]                 ---- Histogram ----
                    Bin Range          |  Num Elems | Visualization
                    (1.19e-07, 0.0448) |     263497 | ########################################
                    (0.0448  , 0.0895) |       1455 |
                    (0.0895  , 0.134 ) |     259559 | #######################################
                    (0.134   , 0.179 ) |       1692 |
                    (0.179   , 0.224 ) |       1301 |
                    (0.224   , 0.269 ) |     258896 | #######################################
                    (0.269   , 0.313 ) |          6 |
                    (0.313   , 0.358 ) |          4 |
                    (0.358   , 0.403 ) |         10 |
                    (0.403   , 0.448 ) |         12 |
['\x1b[38;5;9m'][E]         FAILED | Output: 'output' | Difference exceeds tolerance (rel=0.001, abs=0.001)
['\x1b[38;5;9m'][E]     FAILED | Mismatched outputs: ['output']
['\x1b[38;5;9m'][E] Accuracy Summary | trt-runner-N0-03/13/25-15:32:52 vs. onnxrt-runner-N0-03/13/25-15:32:52 | Passed: 0/1 iterations | Pass Rate: 0.0%
['\x1b[38;5;9m'][E] FAILED | Runtime: 126.740s | Command: \\?\C:\Users\msrir\anaconda3\envs\onnx-opt\Scripts\polygraphy run model.onnx --trt --onnxrt --precision-constraints obey --save-engine model.engine --providers CUDAExecutionProvider --atol 1e-3 --rtol 1e-3