Accuracy drop in resize op when converting from ONNX to TRT FP32

Description

There is an accuracy drop when converting a simple resize op from ONNX to TRT.
I’ve noticed a general performance drop in few of my models and used polygraphy to debug.
I was able to trace the root cause to a Resize node and was able to reproduce this in a simple example (see relevant files and steps to reproduce in the appropriate sections)

the resize op is defined in the ONNX model as follows:

==== ONNX Model ====
    Name: tf2onnx | ONNX Opset: 11 | Other Opsets: {'com.microsoft.nchwc': 1, 'ai.onnx.ml': 2, 'com.microsoft.mlfeaturizers': 1, 'com.microsoft': 1, 'ai.onnx.training': 1, 'ai.onnx.preview.training': 1}

    ---- Docstring ----
    converted from /home/jovyan/greeneye/models/detection/test_resize_model/resize_dummy/saved_model

    ---- 1 Graph Input(s) ----
    {input_1 [dtype=float32, shape=(1, 128, 12, 90)]}

    ---- 1 Graph Output(s) ----
    {tf_op_layer_ResizeNearestNeighbor [dtype=float32, shape=(1, 256, 24, 90)]}

    ---- 2 Initializer(s) ----
    {roi__6 [dtype=float32, shape=(0,)],
     Concat__16:0 [dtype=int64, shape=(4,)]}

    ---- 3 Node(s) ----
    Node 0    | Transpose__10 [Op: Transpose]
        {input_1 [dtype=float32, shape=(1, 128, 12, 90)]}
         -> {Transpose__10:0}
        ---- Attributes ----
        Transpose__10.perm = [0, 3, 1, 2]

    Node 1    | Resize__17 [Op: Resize]
        {Transpose__10:0,
         Initializer | roi__6 [dtype=float32, shape=(0,)],
         Initializer | roi__6 [dtype=float32, shape=(0,)],
         Initializer | Concat__16:0 [dtype=int64, shape=(4,)]}
         -> {Resize__17:0}
        ---- Attributes ----
        Resize__17.extrapolation_value = 0.0
        Resize__17.cubic_coeff_a = -0.75
        Resize__17.coordinate_transformation_mode = asymmetric
        Resize__17.exclude_outside = 0
        Resize__17.mode = nearest
        Resize__17.nearest_mode = floor

    Node 2    | PartitionedCall/functional_1/tf_op_layer_ResizeNearestNeighbor/ResizeNearestNeighbor [Op: Transpose]
        {Resize__17:0}
         -> {tf_op_layer_ResizeNearestNeighbor [dtype=float32, shape=(1, 256, 24, 90)]}
        ---- Attributes ----
        PartitionedCall/functional_1/tf_op_layer_ResizeNearestNeighbor/ResizeNearestNeighbor.perm = [0, 2, 3, 1]

when i run polygraphy run model.onnx --workspace 1G --atol 1e-1 --rtol 1e-1 --trt --onnxrt --trt-outputs mark all --onnx-outputs mark all i get the following accuracy issue:

[W] --workspace is deprecated and will be removed in Polygraphy 0.45.0. Use --pool-limit workspace:1G instead.
[I] RUNNING | Command: /usr/local/bin/polygraphy run /green/temp/resize_dummy_explicit_dims_batch_1.onnx --workspace 1G --atol 1e-1 --rtol 1e-1 --trt --onnxrt --trt-outputs mark all --onnx-outputs mark all
[I] trt-runner-N0-02/16/23-23:27:03     | Activating and starting inference
[TensorRT] WARNING: onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[I]     Configuring with profiles: [Profile().add('input_1', min=[1, 128, 12, 90], opt=[1, 128, 12, 90], max=[1, 128, 12, 90])]
[W] It looks like some layers in the network have compute precision set, but precision constraints were not enabled.
    Precision constraints must be set to 'prefer' or 'obey' for layer compute precision to take effect.
    Note: Layers and their requested precisions were: {'(Unnamed Layer* 2) [Constant]': 'INT32'}
[I] Building engine with configuration:
    Flags                  | []
    DLA                    | Default Device Type: DeviceType.GPU, Core: 0
    Profiling Verbosity    | ProfilingVerbosity.VERBOSE
[I] Finished engine building in 1.738 seconds
[I] trt-runner-N0-02/16/23-23:27:03
    ---- Inference Input(s) ----
    {input_1 [dtype=float32, shape=(1, 128, 12, 90)]}
[I] trt-runner-N0-02/16/23-23:27:03
    ---- Inference Output(s) ----
    {Transpose__10:0 [dtype=float32, shape=(1, 90, 128, 12)],
     Resize__17:0 [dtype=float32, shape=(1, 90, 256, 24)],
     tf_op_layer_ResizeNearestNeighbor [dtype=float32, shape=(1, 256, 24, 90)]}
[I] trt-runner-N0-02/16/23-23:27:03     | Completed 1 iteration(s) in 8.731 ms | Average inference time: 8.731 ms.
[I] onnxrt-runner-N0-02/16/23-23:27:03  | Activating and starting inference
[I] Loading model: /green/temp/resize_dummy_explicit_dims_batch_1.onnx
[I] Creating ONNX-Runtime Inference Session with providers: ['CPUExecutionProvider']
[I] onnxrt-runner-N0-02/16/23-23:27:03
    ---- Inference Input(s) ----
    {input_1 [dtype=float32, shape=(1, 128, 12, 90)]}
[I] onnxrt-runner-N0-02/16/23-23:27:03
    ---- Inference Output(s) ----
    {Transpose__10:0 [dtype=float32, shape=(1, 90, 128, 12)],
     Resize__17:0 [dtype=float32, shape=(1, 90, 256, 24)],
     tf_op_layer_ResizeNearestNeighbor [dtype=float32, shape=(1, 256, 24, 90)]}
[I] onnxrt-runner-N0-02/16/23-23:27:03  | Completed 1 iteration(s) in 12.48 ms | Average inference time: 12.48 ms.
[I] Accuracy Comparison | trt-runner-N0-02/16/23-23:27:03 vs. onnxrt-runner-N0-02/16/23-23:27:03
[I]     Comparing Output: 'Transpose__10:0' (dtype=float32, shape=(1, 90, 128, 12)) with 'Transpose__10:0' (dtype=float32, shape=(1, 90, 128, 12))
[I]         Tolerance: [abs=0.1, rel=0.1] | Checking elemwise error
[I]         trt-runner-N0-02/16/23-23:27:03: Transpose__10:0 | Stats: mean=0.49967, std-dev=0.28929, var=0.083687, median=0.50057, min=1.0369e-05 at (0, 58, 43, 9), max=0.99999 at (0, 15, 35, 11), avg-magnitude=0.49967
[I]         onnxrt-runner-N0-02/16/23-23:27:03: Transpose__10:0 | Stats: mean=0.49967, std-dev=0.28929, var=0.083687, median=0.50057, min=1.0369e-05 at (0, 58, 43, 9), max=0.99999 at (0, 15, 35, 11), avg-magnitude=0.49967
[I]         Error Metrics: Transpose__10:0
[I]             Minimum Required Tolerance: elemwise error | [abs=0] OR [rel=0] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0, std-dev=0, var=0, median=0, min=0 at (0, 0, 0, 0), max=0 at (0, 0, 0, 0), avg-magnitude=0
[I]             Relative Difference | Stats: mean=0, std-dev=0, var=0, median=0, min=0 at (0, 0, 0, 0), max=0 at (0, 0, 0, 0), avg-magnitude=0
[I]         PASSED | Output: 'Transpose__10:0' | Difference is within tolerance (rel=0.1, abs=0.1)
[I]     Comparing Output: 'Resize__17:0' (dtype=float32, shape=(1, 90, 256, 24)) with 'Resize__17:0' (dtype=float32, shape=(1, 90, 256, 24))
[I]         Tolerance: [abs=0.1, rel=0.1] | Checking elemwise error
[I]         trt-runner-N0-02/16/23-23:27:03: Resize__17:0 | Stats: mean=0.49975, std-dev=0.28933, var=0.083713, median=0.50058, min=1.0369e-05 at (0, 58, 87, 19), max=0.99999 at (0, 15, 71, 23), avg-magnitude=0.49975
[I]             ---- Histogram ----
                Bin Range       |  Num Elems | Visualization
                (1.04e-05, 0.1) |      56111 | ########################################
                (0.1     , 0.2) |      55311 | #######################################
                (0.2     , 0.3) |      55494 | #######################################
                (0.3     , 0.4) |      54449 | ######################################
                (0.4     , 0.5) |      54803 | #######################################
                (0.5     , 0.6) |      55559 | #######################################
                (0.6     , 0.7) |      55078 | #######################################
                (0.7     , 0.8) |      55530 | #######################################
                (0.8     , 0.9) |      55028 | #######################################
                (0.9     , 1  ) |      55597 | #######################################
[I]         onnxrt-runner-N0-02/16/23-23:27:03: Resize__17:0 | Stats: mean=0.49967, std-dev=0.28929, var=0.083687, median=0.50057, min=1.0369e-05 at (0, 58, 86, 18), max=0.99999 at (0, 15, 70, 22), avg-magnitude=0.49967
[I]             ---- Histogram ----
                Bin Range       |  Num Elems | Visualization
                (1.04e-05, 0.1) |      56056 | ########################################
                (0.1     , 0.2) |      55292 | #######################################
                (0.2     , 0.3) |      55628 | #######################################
                (0.3     , 0.4) |      54380 | ######################################
                (0.4     , 0.5) |      54828 | #######################################
                (0.5     , 0.6) |      55616 | #######################################
                (0.6     , 0.7) |      55144 | #######################################
                (0.7     , 0.8) |      55504 | #######################################
                (0.8     , 0.9) |      54932 | #######################################
                (0.9     , 1  ) |      55580 | #######################################
[I]         Error Metrics: Resize__17:0
[I]             Minimum Required Tolerance: elemwise error | [abs=0.9989] OR [rel=80209] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.24374, std-dev=0.25036, var=0.06268, median=0.17254, min=0 at (0, 0, 0, 0), max=0.9989 at (0, 38, 129, 4), avg-magnitude=0.24374
[I]                 ---- Histogram ----
                    Bin Range        |  Num Elems | Visualization
                    (0     , 0.0999) |     225868 | ########################################
                    (0.0999, 0.2   ) |      68455 | ############
                    (0.2   , 0.3   ) |      60353 | ##########
                    (0.3   , 0.4   ) |      52322 | #########
                    (0.4   , 0.499 ) |      44280 | #######
                    (0.499 , 0.599 ) |      36490 | ######
                    (0.599 , 0.699 ) |      28309 | #####
                    (0.699 , 0.799 ) |      20511 | ###
                    (0.799 , 0.899 ) |      12273 | ##
                    (0.899 , 0.999 ) |       4099 |
[I]             Relative Difference | Stats: mean=4.3466, std-dev=211.49, var=44727, median=0.36421, min=0 at (0, 0, 0, 0), max=80209 at (0, 79, 89, 16), avg-magnitude=4.3466
[I]                 ---- Histogram ----
                    Bin Range            |  Num Elems | Visualization
                    (0       , 8.02e+03) |     552930 | ########################################
                    (8.02e+03, 1.6e+04 ) |         15 |
                    (1.6e+04 , 2.41e+04) |          6 |
                    (2.41e+04, 3.21e+04) |          2 |
                    (3.21e+04, 4.01e+04) |          1 |
                    (4.01e+04, 4.81e+04) |          4 |
                    (4.81e+04, 5.61e+04) |          1 |
                    (5.61e+04, 6.42e+04) |          0 |
                    (6.42e+04, 7.22e+04) |          0 |
                    (7.22e+04, 8.02e+04) |          1 |
[E]         FAILED | Output: 'Resize__17:0' | Difference exceeds tolerance (rel=0.1, abs=0.1)
[I]     Comparing Output: 'tf_op_layer_ResizeNearestNeighbor' (dtype=float32, shape=(1, 256, 24, 90)) with 'tf_op_layer_ResizeNearestNeighbor' (dtype=float32, shape=(1, 256, 24, 90))
[I]         Tolerance: [abs=0.1, rel=0.1] | Checking elemwise error
[I]         trt-runner-N0-02/16/23-23:27:03: tf_op_layer_ResizeNearestNeighbor | Stats: mean=0.49975, std-dev=0.28933, var=0.083713, median=0.50058, min=1.0369e-05 at (0, 87, 19, 58), max=0.99999 at (0, 71, 23, 15), avg-magnitude=0.49975
[I]             ---- Histogram ----
                Bin Range       |  Num Elems | Visualization
                (1.04e-05, 0.1) |      56111 | ########################################
                (0.1     , 0.2) |      55311 | #######################################
                (0.2     , 0.3) |      55494 | #######################################
                (0.3     , 0.4) |      54449 | ######################################
                (0.4     , 0.5) |      54803 | #######################################
                (0.5     , 0.6) |      55559 | #######################################
                (0.6     , 0.7) |      55078 | #######################################
                (0.7     , 0.8) |      55530 | #######################################
                (0.8     , 0.9) |      55028 | #######################################
                (0.9     , 1  ) |      55597 | #######################################
[I]         onnxrt-runner-N0-02/16/23-23:27:03: tf_op_layer_ResizeNearestNeighbor | Stats: mean=0.49967, std-dev=0.28929, var=0.083687, median=0.50057, min=1.0369e-05 at (0, 86, 18, 58), max=0.99999 at (0, 70, 22, 15), avg-magnitude=0.49967
[I]             ---- Histogram ----
                Bin Range       |  Num Elems | Visualization
                (1.04e-05, 0.1) |      56056 | ########################################
                (0.1     , 0.2) |      55292 | #######################################
                (0.2     , 0.3) |      55628 | #######################################
                (0.3     , 0.4) |      54380 | ######################################
                (0.4     , 0.5) |      54828 | #######################################
                (0.5     , 0.6) |      55616 | #######################################
                (0.6     , 0.7) |      55144 | #######################################
                (0.7     , 0.8) |      55504 | #######################################
                (0.8     , 0.9) |      54932 | #######################################
                (0.9     , 1  ) |      55580 | #######################################
[I]         Error Metrics: tf_op_layer_ResizeNearestNeighbor
[I]             Minimum Required Tolerance: elemwise error | [abs=0.9989] OR [rel=80209] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.24374, std-dev=0.25036, var=0.06268, median=0.17254, min=0 at (0, 0, 0, 0), max=0.9989 at (0, 129, 4, 38), avg-magnitude=0.24374
[I]                 ---- Histogram ----
                    Bin Range        |  Num Elems | Visualization
                    (0     , 0.0999) |     225868 | ########################################
                    (0.0999, 0.2   ) |      68455 | ############
                    (0.2   , 0.3   ) |      60353 | ##########
                    (0.3   , 0.4   ) |      52322 | #########
                    (0.4   , 0.499 ) |      44280 | #######
                    (0.499 , 0.599 ) |      36490 | ######
                    (0.599 , 0.699 ) |      28309 | #####
                    (0.699 , 0.799 ) |      20511 | ###
                    (0.799 , 0.899 ) |      12273 | ##
                    (0.899 , 0.999 ) |       4099 |
[I]             Relative Difference | Stats: mean=4.3466, std-dev=211.49, var=44727, median=0.36421, min=0 at (0, 0, 0, 0), max=80209 at (0, 89, 16, 79), avg-magnitude=4.3466
[I]                 ---- Histogram ----
                    Bin Range            |  Num Elems | Visualization
                    (0       , 8.02e+03) |     552930 | ########################################
                    (8.02e+03, 1.6e+04 ) |         15 |
                    (1.6e+04 , 2.41e+04) |          6 |
                    (2.41e+04, 3.21e+04) |          2 |
                    (3.21e+04, 4.01e+04) |          1 |
                    (4.01e+04, 4.81e+04) |          4 |
                    (4.81e+04, 5.61e+04) |          1 |
                    (5.61e+04, 6.42e+04) |          0 |
                    (6.42e+04, 7.22e+04) |          0 |
                    (7.22e+04, 8.02e+04) |          1 |
[E]         FAILED | Output: 'tf_op_layer_ResizeNearestNeighbor' | Difference exceeds tolerance (rel=0.1, abs=0.1)
[E]     FAILED | Mismatched outputs: ['Resize__17:0', 'tf_op_layer_ResizeNearestNeighbor']
[E] Accuracy Summary | trt-runner-N0-02/16/23-23:27:03 vs. onnxrt-runner-N0-02/16/23-23:27:03 | Passed: 0/1 iterations | Pass Rate: 0.0%
[E] FAILED | Runtime: 4.653s | Command: /usr/local/bin/polygraphy run /green/temp/resize_dummy_explicit_dims_batch_1.onnx --workspace 1G --atol 1e-1 --rtol 1e-1 --trt --onnxrt --trt-outputs mark all --onnx-outputs mark all

Also notice the high tolerance picked for mistakes, the default is 1e-5.

I believe the issue is in building the TRT op from the current attributes of the ONNX op. causing some other mathematic operation to be done that causes the difference in the output.
Since the actual implementation of the TRT op isnt available, I hope you could find the bug and solve it.

I would also appreciate a workaround until it is solved.

Thanks in advance

Environment

TensorRT Version : 7.1.3
NVIDIA GPU : Jetson AGX Xavier
NVIDIA Driver Version : l4t 32.5.1
CUDA Version : 10.2
CUDNN Version :
ONNX runtime : onnxruntime 1.10.0
ONNX opset : 11
Operating System : Ubuntu 18.04.5 LTS
Python Version (if applicable) : 3.6.9
Tensorflow Version (if applicable) : 2.3 (original model was created with TF, then converted to ONNX)
PyTorch Version (if applicable) :
Baremetal or Container (if so, version) : custom container

Relevant files to reproduce

link to a ONNX model that has the issue when converted to TRT

Hi,

We recommend that you please try the latest TensorRT version 8.5 and let us know if you still face this issue.
If you’re interested, you can also try the TensorRT NGC container to avoid setup related issues.

Thank you.

Hey @spolisetty and thank you for the quick response.

Unfortunately I don’t have any option to upgrade to the latest jetpack at this time.
Is there a way to upgrade my TRT version without upgrading the jetpack?
Do you know the root cause of the issue? what has changed between my version and version 8.5 for a Resize op? which “minimal” version of TRT solves the issue?

2 Likes

Hi @weissrael ,
Are you still facing the issue?

Hi @AakankshaS
Yes, basically
I found a possible workaround in terms of onnx<–>TRT compatibility for TRT version 7.1- adjusting the resize interpolation method to “bilinear” instead of “nearest_neighbor”. Problem is, this workaround requires finetune training the model more, otherwise the model isn’t usable in terms of reasonable predictions.
It takes too much time so I’d rather have a solution that doesn’t require more finetune training.
If there’s a workaround to upgrade TRT without upgrading the Jetpack, and having the original model I have without training it more, this would be ideal- either for me or for other people who might have this issue.

@weissrael did you find a solution for this? I am currently facing something similar