Accuracy drop in resize op when converting from ONNX to TRT FP32

weissrael · February 17, 2023, 7:43am

Description

There is an accuracy drop when converting a simple resize op from ONNX to TRT.
I’ve noticed a general performance drop in few of my models and used polygraphy to debug.
I was able to trace the root cause to a Resize node and was able to reproduce this in a simple example (see relevant files and steps to reproduce in the appropriate sections)

the resize op is defined in the ONNX model as follows:

==== ONNX Model ====
    Name: tf2onnx | ONNX Opset: 11 | Other Opsets: {'com.microsoft.nchwc': 1, 'ai.onnx.ml': 2, 'com.microsoft.mlfeaturizers': 1, 'com.microsoft': 1, 'ai.onnx.training': 1, 'ai.onnx.preview.training': 1}

    ---- Docstring ----
    converted from /home/jovyan/greeneye/models/detection/test_resize_model/resize_dummy/saved_model

    ---- 1 Graph Input(s) ----
    {input_1 [dtype=float32, shape=(1, 128, 12, 90)]}

    ---- 1 Graph Output(s) ----
    {tf_op_layer_ResizeNearestNeighbor [dtype=float32, shape=(1, 256, 24, 90)]}

    ---- 2 Initializer(s) ----
    {roi__6 [dtype=float32, shape=(0,)],
     Concat__16:0 [dtype=int64, shape=(4,)]}

    ---- 3 Node(s) ----
    Node 0    | Transpose__10 [Op: Transpose]
        {input_1 [dtype=float32, shape=(1, 128, 12, 90)]}
         -> {Transpose__10:0}
        ---- Attributes ----
        Transpose__10.perm = [0, 3, 1, 2]

    Node 1    | Resize__17 [Op: Resize]
        {Transpose__10:0,
         Initializer | roi__6 [dtype=float32, shape=(0,)],
         Initializer | roi__6 [dtype=float32, shape=(0,)],
         Initializer | Concat__16:0 [dtype=int64, shape=(4,)]}
         -> {Resize__17:0}
        ---- Attributes ----
        Resize__17.extrapolation_value = 0.0
        Resize__17.cubic_coeff_a = -0.75
        Resize__17.coordinate_transformation_mode = asymmetric
        Resize__17.exclude_outside = 0
        Resize__17.mode = nearest
        Resize__17.nearest_mode = floor

    Node 2    | PartitionedCall/functional_1/tf_op_layer_ResizeNearestNeighbor/ResizeNearestNeighbor [Op: Transpose]
        {Resize__17:0}
         -> {tf_op_layer_ResizeNearestNeighbor [dtype=float32, shape=(1, 256, 24, 90)]}
        ---- Attributes ----
        PartitionedCall/functional_1/tf_op_layer_ResizeNearestNeighbor/ResizeNearestNeighbor.perm = [0, 2, 3, 1]

when i run polygraphy run model.onnx --workspace 1G --atol 1e-1 --rtol 1e-1 --trt --onnxrt --trt-outputs mark all --onnx-outputs mark all i get the following accuracy issue:

[W] --workspace is deprecated and will be removed in Polygraphy 0.45.0. Use --pool-limit workspace:1G instead.
[I] RUNNING | Command: /usr/local/bin/polygraphy run /green/temp/resize_dummy_explicit_dims_batch_1.onnx --workspace 1G --atol 1e-1 --rtol 1e-1 --trt --onnxrt --trt-outputs mark all --onnx-outputs mark all
[I] trt-runner-N0-02/16/23-23:27:03     | Activating and starting inference
[TensorRT] WARNING: onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[I]     Configuring with profiles: [Profile().add('input_1', min=[1, 128, 12, 90], opt=[1, 128, 12, 90], max=[1, 128, 12, 90])]
[W] It looks like some layers in the network have compute precision set, but precision constraints were not enabled.
    Precision constraints must be set to 'prefer' or 'obey' for layer compute precision to take effect.
    Note: Layers and their requested precisions were: {'(Unnamed Layer* 2) [Constant]': 'INT32'}
[I] Building engine with configuration:
    Flags                  | []
    DLA                    | Default Device Type: DeviceType.GPU, Core: 0
    Profiling Verbosity    | ProfilingVerbosity.VERBOSE
[I] Finished engine building in 1.738 seconds
[I] trt-runner-N0-02/16/23-23:27:03
    ---- Inference Input(s) ----
    {input_1 [dtype=float32, shape=(1, 128, 12, 90)]}
[I] trt-runner-N0-02/16/23-23:27:03
    ---- Inference Output(s) ----
    {Transpose__10:0 [dtype=float32, shape=(1, 90, 128, 12)],
     Resize__17:0 [dtype=float32, shape=(1, 90, 256, 24)],
     tf_op_layer_ResizeNearestNeighbor [dtype=float32, shape=(1, 256, 24, 90)]}
[I] trt-runner-N0-02/16/23-23:27:03     | Completed 1 iteration(s) in 8.731 ms | Average inference time: 8.731 ms.
[I] onnxrt-runner-N0-02/16/23-23:27:03  | Activating and starting inference
[I] Loading model: /green/temp/resize_dummy_explicit_dims_batch_1.onnx
[I] Creating ONNX-Runtime Inference Session with providers: ['CPUExecutionProvider']
[I] onnxrt-runner-N0-02/16/23-23:27:03
    ---- Inference Input(s) ----
    {input_1 [dtype=float32, shape=(1, 128, 12, 90)]}
[I] onnxrt-runner-N0-02/16/23-23:27:03
    ---- Inference Output(s) ----
    {Transpose__10:0 [dtype=float32, shape=(1, 90, 128, 12)],
     Resize__17:0 [dtype=float32, shape=(1, 90, 256, 24)],
     tf_op_layer_ResizeNearestNeighbor [dtype=float32, shape=(1, 256, 24, 90)]}
[I] onnxrt-runner-N0-02/16/23-23:27:03  | Completed 1 iteration(s) in 12.48 ms | Average inference time: 12.48 ms.
[I] Accuracy Comparison | trt-runner-N0-02/16/23-23:27:03 vs. onnxrt-runner-N0-02/16/23-23:27:03
[I]     Comparing Output: 'Transpose__10:0' (dtype=float32, shape=(1, 90, 128, 12)) with 'Transpose__10:0' (dtype=float32, shape=(1, 90, 128, 12))
[I]         Tolerance: [abs=0.1, rel=0.1] | Checking elemwise error
[I]         trt-runner-N0-02/16/23-23:27:03: Transpose__10:0 | Stats: mean=0.49967, std-dev=0.28929, var=0.083687, median=0.50057, min=1.0369e-05 at (0, 58, 43, 9), max=0.99999 at (0, 15, 35, 11), avg-magnitude=0.49967
[I]         onnxrt-runner-N0-02/16/23-23:27:03: Transpose__10:0 | Stats: mean=0.49967, std-dev=0.28929, var=0.083687, median=0.50057, min=1.0369e-05 at (0, 58, 43, 9), max=0.99999 at (0, 15, 35, 11), avg-magnitude=0.49967
[I]         Error Metrics: Transpose__10:0
[I]             Minimum Required Tolerance: elemwise error | [abs=0] OR [rel=0] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0, std-dev=0, var=0, median=0, min=0 at (0, 0, 0, 0), max=0 at (0, 0, 0, 0), avg-magnitude=0
[I]             Relative Difference | Stats: mean=0, std-dev=0, var=0, median=0, min=0 at (0, 0, 0, 0), max=0 at (0, 0, 0, 0), avg-magnitude=0
[I]         PASSED | Output: 'Transpose__10:0' | Difference is within tolerance (rel=0.1, abs=0.1)
[I]     Comparing Output: 'Resize__17:0' (dtype=float32, shape=(1, 90, 256, 24)) with 'Resize__17:0' (dtype=float32, shape=(1, 90, 256, 24))
[I]         Tolerance: [abs=0.1, rel=0.1] | Checking elemwise error
[I]         trt-runner-N0-02/16/23-23:27:03: Resize__17:0 | Stats: mean=0.49975, std-dev=0.28933, var=0.083713, median=0.50058, min=1.0369e-05 at (0, 58, 87, 19), max=0.99999 at (0, 15, 71, 23), avg-magnitude=0.49975
[I]             ---- Histogram ----
                Bin Range       |  Num Elems | Visualization
                (1.04e-05, 0.1) |      56111 | ########################################
                (0.1     , 0.2) |      55311 | #######################################
                (0.2     , 0.3) |      55494 | #######################################
                (0.3     , 0.4) |      54449 | ######################################
                (0.4     , 0.5) |      54803 | #######################################
                (0.5     , 0.6) |      55559 | #######################################
                (0.6     , 0.7) |      55078 | #######################################
                (0.7     , 0.8) |      55530 | #######################################
                (0.8     , 0.9) |      55028 | #######################################
                (0.9     , 1  ) |      55597 | #######################################
[I]         onnxrt-runner-N0-02/16/23-23:27:03: Resize__17:0 | Stats: mean=0.49967, std-dev=0.28929, var=0.083687, median=0.50057, min=1.0369e-05 at (0, 58, 86, 18), max=0.99999 at (0, 15, 70, 22), avg-magnitude=0.49967
[I]             ---- Histogram ----
                Bin Range       |  Num Elems | Visualization
                (1.04e-05, 0.1) |      56056 | ########################################
                (0.1     , 0.2) |      55292 | #######################################
                (0.2     , 0.3) |      55628 | #######################################
                (0.3     , 0.4) |      54380 | ######################################
                (0.4     , 0.5) |      54828 | #######################################
                (0.5     , 0.6) |      55616 | #######################################
                (0.6     , 0.7) |      55144 | #######################################
                (0.7     , 0.8) |      55504 | #######################################
                (0.8     , 0.9) |      54932 | #######################################
                (0.9     , 1  ) |      55580 | #######################################
[I]         Error Metrics: Resize__17:0
[I]             Minimum Required Tolerance: elemwise error | [abs=0.9989] OR [rel=80209] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.24374, std-dev=0.25036, var=0.06268, median=0.17254, min=0 at (0, 0, 0, 0), max=0.9989 at (0, 38, 129, 4), avg-magnitude=0.24374
[I]                 ---- Histogram ----
                    Bin Range        |  Num Elems | Visualization
                    (0     , 0.0999) |     225868 | ########################################
                    (0.0999, 0.2   ) |      68455 | ############
                    (0.2   , 0.3   ) |      60353 | ##########
                    (0.3   , 0.4   ) |      52322 | #########
                    (0.4   , 0.499 ) |      44280 | #######
                    (0.499 , 0.599 ) |      36490 | ######
                    (0.599 , 0.699 ) |      28309 | #####
                    (0.699 , 0.799 ) |      20511 | ###
                    (0.799 , 0.899 ) |      12273 | ##
                    (0.899 , 0.999 ) |       4099 |
[I]             Relative Difference | Stats: mean=4.3466, std-dev=211.49, var=44727, median=0.36421, min=0 at (0, 0, 0, 0), max=80209 at (0, 79, 89, 16), avg-magnitude=4.3466
[I]                 ---- Histogram ----
                    Bin Range            |  Num Elems | Visualization
                    (0       , 8.02e+03) |     552930 | ########################################
                    (8.02e+03, 1.6e+04 ) |         15 |
                    (1.6e+04 , 2.41e+04) |          6 |
                    (2.41e+04, 3.21e+04) |          2 |
                    (3.21e+04, 4.01e+04) |          1 |
                    (4.01e+04, 4.81e+04) |          4 |
                    (4.81e+04, 5.61e+04) |          1 |
                    (5.61e+04, 6.42e+04) |          0 |
                    (6.42e+04, 7.22e+04) |          0 |
                    (7.22e+04, 8.02e+04) |          1 |
[E]         FAILED | Output: 'Resize__17:0' | Difference exceeds tolerance (rel=0.1, abs=0.1)
[I]     Comparing Output: 'tf_op_layer_ResizeNearestNeighbor' (dtype=float32, shape=(1, 256, 24, 90)) with 'tf_op_layer_ResizeNearestNeighbor' (dtype=float32, shape=(1, 256, 24, 90))
[I]         Tolerance: [abs=0.1, rel=0.1] | Checking elemwise error
[I]         trt-runner-N0-02/16/23-23:27:03: tf_op_layer_ResizeNearestNeighbor | Stats: mean=0.49975, std-dev=0.28933, var=0.083713, median=0.50058, min=1.0369e-05 at (0, 87, 19, 58), max=0.99999 at (0, 71, 23, 15), avg-magnitude=0.49975
[I]             ---- Histogram ----
                Bin Range       |  Num Elems | Visualization
                (1.04e-05, 0.1) |      56111 | ########################################
                (0.1     , 0.2) |      55311 | #######################################
                (0.2     , 0.3) |      55494 | #######################################
                (0.3     , 0.4) |      54449 | ######################################
                (0.4     , 0.5) |      54803 | #######################################
                (0.5     , 0.6) |      55559 | #######################################
                (0.6     , 0.7) |      55078 | #######################################
                (0.7     , 0.8) |      55530 | #######################################
                (0.8     , 0.9) |      55028 | #######################################
                (0.9     , 1  ) |      55597 | #######################################
[I]         onnxrt-runner-N0-02/16/23-23:27:03: tf_op_layer_ResizeNearestNeighbor | Stats: mean=0.49967, std-dev=0.28929, var=0.083687, median=0.50057, min=1.0369e-05 at (0, 86, 18, 58), max=0.99999 at (0, 70, 22, 15), avg-magnitude=0.49967
[I]             ---- Histogram ----
                Bin Range       |  Num Elems | Visualization
                (1.04e-05, 0.1) |      56056 | ########################################
                (0.1     , 0.2) |      55292 | #######################################
                (0.2     , 0.3) |      55628 | #######################################
                (0.3     , 0.4) |      54380 | ######################################
                (0.4     , 0.5) |      54828 | #######################################
                (0.5     , 0.6) |      55616 | #######################################
                (0.6     , 0.7) |      55144 | #######################################
                (0.7     , 0.8) |      55504 | #######################################
                (0.8     , 0.9) |      54932 | #######################################
                (0.9     , 1  ) |      55580 | #######################################
[I]         Error Metrics: tf_op_layer_ResizeNearestNeighbor
[I]             Minimum Required Tolerance: elemwise error | [abs=0.9989] OR [rel=80209] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.24374, std-dev=0.25036, var=0.06268, median=0.17254, min=0 at (0, 0, 0, 0), max=0.9989 at (0, 129, 4, 38), avg-magnitude=0.24374
[I]                 ---- Histogram ----
                    Bin Range        |  Num Elems | Visualization
                    (0     , 0.0999) |     225868 | ########################################
                    (0.0999, 0.2   ) |      68455 | ############
                    (0.2   , 0.3   ) |      60353 | ##########
                    (0.3   , 0.4   ) |      52322 | #########
                    (0.4   , 0.499 ) |      44280 | #######
                    (0.499 , 0.599 ) |      36490 | ######
                    (0.599 , 0.699 ) |      28309 | #####
                    (0.699 , 0.799 ) |      20511 | ###
                    (0.799 , 0.899 ) |      12273 | ##
                    (0.899 , 0.999 ) |       4099 |
[I]             Relative Difference | Stats: mean=4.3466, std-dev=211.49, var=44727, median=0.36421, min=0 at (0, 0, 0, 0), max=80209 at (0, 89, 16, 79), avg-magnitude=4.3466
[I]                 ---- Histogram ----
                    Bin Range            |  Num Elems | Visualization
                    (0       , 8.02e+03) |     552930 | ########################################
                    (8.02e+03, 1.6e+04 ) |         15 |
                    (1.6e+04 , 2.41e+04) |          6 |
                    (2.41e+04, 3.21e+04) |          2 |
                    (3.21e+04, 4.01e+04) |          1 |
                    (4.01e+04, 4.81e+04) |          4 |
                    (4.81e+04, 5.61e+04) |          1 |
                    (5.61e+04, 6.42e+04) |          0 |
                    (6.42e+04, 7.22e+04) |          0 |
                    (7.22e+04, 8.02e+04) |          1 |
[E]         FAILED | Output: 'tf_op_layer_ResizeNearestNeighbor' | Difference exceeds tolerance (rel=0.1, abs=0.1)
[E]     FAILED | Mismatched outputs: ['Resize__17:0', 'tf_op_layer_ResizeNearestNeighbor']
[E] Accuracy Summary | trt-runner-N0-02/16/23-23:27:03 vs. onnxrt-runner-N0-02/16/23-23:27:03 | Passed: 0/1 iterations | Pass Rate: 0.0%
[E] FAILED | Runtime: 4.653s | Command: /usr/local/bin/polygraphy run /green/temp/resize_dummy_explicit_dims_batch_1.onnx --workspace 1G --atol 1e-1 --rtol 1e-1 --trt --onnxrt --trt-outputs mark all --onnx-outputs mark all

Also notice the high tolerance picked for mistakes, the default is 1e-5.

I believe the issue is in building the TRT op from the current attributes of the ONNX op. causing some other mathematic operation to be done that causes the difference in the output.
Since the actual implementation of the TRT op isnt available, I hope you could find the bug and solve it.

I would also appreciate a workaround until it is solved.

Thanks in advance

Environment

TensorRT Version : 7.1.3
NVIDIA GPU : Jetson AGX Xavier
NVIDIA Driver Version : l4t 32.5.1
CUDA Version : 10.2
CUDNN Version :
ONNX runtime : onnxruntime 1.10.0
ONNX opset : 11
Operating System : Ubuntu 18.04.5 LTS
Python Version (if applicable) : 3.6.9
Tensorflow Version (if applicable) : 2.3 (original model was created with TF, then converted to ONNX)
PyTorch Version (if applicable) :
Baremetal or Container (if so, version) : custom container

Relevant files to reproduce

link to a ONNX model that has the issue when converted to TRT

spolisetty · February 17, 2023, 10:54am

Hi,

We recommend that you please try the latest TensorRT version 8.5 and let us know if you still face this issue.
If you’re interested, you can also try the TensorRT NGC container to avoid setup related issues.

Thank you.

weissrael · February 19, 2023, 11:03am

Hey @spolisetty and thank you for the quick response.

Unfortunately I don’t have any option to upgrade to the latest jetpack at this time.
Is there a way to upgrade my TRT version without upgrading the jetpack?
Do you know the root cause of the issue? what has changed between my version and version 8.5 for a Resize op? which “minimal” version of TRT solves the issue?

AakankshaS · April 29, 2023, 6:50am

Hi @weissrael ,
Are you still facing the issue?

weissrael · May 3, 2023, 1:39pm

Hi @AakankshaS
Yes, basically
I found a possible workaround in terms of onnx<–>TRT compatibility for TRT version 7.1- adjusting the resize interpolation method to “bilinear” instead of “nearest_neighbor”. Problem is, this workaround requires finetune training the model more, otherwise the model isn’t usable in terms of reasonable predictions.
It takes too much time so I’d rather have a solution that doesn’t require more finetune training.
If there’s a workaround to upgrade TRT without upgrading the Jetpack, and having the original model I have without training it more, this would be ideal- either for me or for other people who might have this issue.

samuel17 · June 23, 2023, 6:26pm

@weissrael did you find a solution for this? I am currently facing something similar

Topic		Replies	Views
Onnx to TensorRT mismatch Jetson Orin NX tensorrt , cuda , cudnn , onnx	11	944	January 15, 2024
Onnx output differs largely to TRT engine output TensorRT	14	1684	February 25, 2023
Tensorrt loss accuracy when test TensorRT tensorrt	6	2065	February 24, 2022
Error outputs for dynamic height and width TensorRT	8	802	November 28, 2022
[gemmBaseRunner.cpp::nvinfer1::rt::task::CaskGemmBaseRunner::executeGemm::455] Error Code 1: Cask (Cask Gemm execution) TensorRT	1	711	August 23, 2023
Yolor to onnx to trt TensorRT	1	1571	September 14, 2022
TensorRT get different result in python and c++ TensorRT	21	2854	August 24, 2022
I do not get any performance improvement after using TensorRT provider for object detection model Jetson Nano tensorrt , onnx	7	1387	July 12, 2022
LSTM ONNX to TensorRT mismatched outputs TensorRT tensorrt	3	934	September 29, 2022
tensorRT inference unstable compared onnxruntime TensorRT	4	1298	May 4, 2021

Accuracy drop in resize op when converting from ONNX to TRT FP32

Description

Environment

Relevant files to reproduce

Related topics