Description
There is an accuracy drop when converting a simple resize op from ONNX to TRT.
I’ve noticed a general performance drop in few of my models and used polygraphy
to debug.
I was able to trace the root cause to a Resize node and was able to reproduce this in a simple example (see relevant files and steps to reproduce in the appropriate sections)
the resize op is defined in the ONNX model as follows:
==== ONNX Model ====
Name: tf2onnx | ONNX Opset: 11 | Other Opsets: {'com.microsoft.nchwc': 1, 'ai.onnx.ml': 2, 'com.microsoft.mlfeaturizers': 1, 'com.microsoft': 1, 'ai.onnx.training': 1, 'ai.onnx.preview.training': 1}
---- Docstring ----
converted from /home/jovyan/greeneye/models/detection/test_resize_model/resize_dummy/saved_model
---- 1 Graph Input(s) ----
{input_1 [dtype=float32, shape=(1, 128, 12, 90)]}
---- 1 Graph Output(s) ----
{tf_op_layer_ResizeNearestNeighbor [dtype=float32, shape=(1, 256, 24, 90)]}
---- 2 Initializer(s) ----
{roi__6 [dtype=float32, shape=(0,)],
Concat__16:0 [dtype=int64, shape=(4,)]}
---- 3 Node(s) ----
Node 0 | Transpose__10 [Op: Transpose]
{input_1 [dtype=float32, shape=(1, 128, 12, 90)]}
-> {Transpose__10:0}
---- Attributes ----
Transpose__10.perm = [0, 3, 1, 2]
Node 1 | Resize__17 [Op: Resize]
{Transpose__10:0,
Initializer | roi__6 [dtype=float32, shape=(0,)],
Initializer | roi__6 [dtype=float32, shape=(0,)],
Initializer | Concat__16:0 [dtype=int64, shape=(4,)]}
-> {Resize__17:0}
---- Attributes ----
Resize__17.extrapolation_value = 0.0
Resize__17.cubic_coeff_a = -0.75
Resize__17.coordinate_transformation_mode = asymmetric
Resize__17.exclude_outside = 0
Resize__17.mode = nearest
Resize__17.nearest_mode = floor
Node 2 | PartitionedCall/functional_1/tf_op_layer_ResizeNearestNeighbor/ResizeNearestNeighbor [Op: Transpose]
{Resize__17:0}
-> {tf_op_layer_ResizeNearestNeighbor [dtype=float32, shape=(1, 256, 24, 90)]}
---- Attributes ----
PartitionedCall/functional_1/tf_op_layer_ResizeNearestNeighbor/ResizeNearestNeighbor.perm = [0, 2, 3, 1]
when i run polygraphy run model.onnx --workspace 1G --atol 1e-1 --rtol 1e-1 --trt --onnxrt --trt-outputs mark all --onnx-outputs mark all
i get the following accuracy issue:
[W] --workspace is deprecated and will be removed in Polygraphy 0.45.0. Use --pool-limit workspace:1G instead.
[I] RUNNING | Command: /usr/local/bin/polygraphy run /green/temp/resize_dummy_explicit_dims_batch_1.onnx --workspace 1G --atol 1e-1 --rtol 1e-1 --trt --onnxrt --trt-outputs mark all --onnx-outputs mark all
[I] trt-runner-N0-02/16/23-23:27:03 | Activating and starting inference
[TensorRT] WARNING: onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[I] Configuring with profiles: [Profile().add('input_1', min=[1, 128, 12, 90], opt=[1, 128, 12, 90], max=[1, 128, 12, 90])]
[W] It looks like some layers in the network have compute precision set, but precision constraints were not enabled.
Precision constraints must be set to 'prefer' or 'obey' for layer compute precision to take effect.
Note: Layers and their requested precisions were: {'(Unnamed Layer* 2) [Constant]': 'INT32'}
[I] Building engine with configuration:
Flags | []
DLA | Default Device Type: DeviceType.GPU, Core: 0
Profiling Verbosity | ProfilingVerbosity.VERBOSE
[I] Finished engine building in 1.738 seconds
[I] trt-runner-N0-02/16/23-23:27:03
---- Inference Input(s) ----
{input_1 [dtype=float32, shape=(1, 128, 12, 90)]}
[I] trt-runner-N0-02/16/23-23:27:03
---- Inference Output(s) ----
{Transpose__10:0 [dtype=float32, shape=(1, 90, 128, 12)],
Resize__17:0 [dtype=float32, shape=(1, 90, 256, 24)],
tf_op_layer_ResizeNearestNeighbor [dtype=float32, shape=(1, 256, 24, 90)]}
[I] trt-runner-N0-02/16/23-23:27:03 | Completed 1 iteration(s) in 8.731 ms | Average inference time: 8.731 ms.
[I] onnxrt-runner-N0-02/16/23-23:27:03 | Activating and starting inference
[I] Loading model: /green/temp/resize_dummy_explicit_dims_batch_1.onnx
[I] Creating ONNX-Runtime Inference Session with providers: ['CPUExecutionProvider']
[I] onnxrt-runner-N0-02/16/23-23:27:03
---- Inference Input(s) ----
{input_1 [dtype=float32, shape=(1, 128, 12, 90)]}
[I] onnxrt-runner-N0-02/16/23-23:27:03
---- Inference Output(s) ----
{Transpose__10:0 [dtype=float32, shape=(1, 90, 128, 12)],
Resize__17:0 [dtype=float32, shape=(1, 90, 256, 24)],
tf_op_layer_ResizeNearestNeighbor [dtype=float32, shape=(1, 256, 24, 90)]}
[I] onnxrt-runner-N0-02/16/23-23:27:03 | Completed 1 iteration(s) in 12.48 ms | Average inference time: 12.48 ms.
[I] Accuracy Comparison | trt-runner-N0-02/16/23-23:27:03 vs. onnxrt-runner-N0-02/16/23-23:27:03
[I] Comparing Output: 'Transpose__10:0' (dtype=float32, shape=(1, 90, 128, 12)) with 'Transpose__10:0' (dtype=float32, shape=(1, 90, 128, 12))
[I] Tolerance: [abs=0.1, rel=0.1] | Checking elemwise error
[I] trt-runner-N0-02/16/23-23:27:03: Transpose__10:0 | Stats: mean=0.49967, std-dev=0.28929, var=0.083687, median=0.50057, min=1.0369e-05 at (0, 58, 43, 9), max=0.99999 at (0, 15, 35, 11), avg-magnitude=0.49967
[I] onnxrt-runner-N0-02/16/23-23:27:03: Transpose__10:0 | Stats: mean=0.49967, std-dev=0.28929, var=0.083687, median=0.50057, min=1.0369e-05 at (0, 58, 43, 9), max=0.99999 at (0, 15, 35, 11), avg-magnitude=0.49967
[I] Error Metrics: Transpose__10:0
[I] Minimum Required Tolerance: elemwise error | [abs=0] OR [rel=0] (requirements may be lower if both abs/rel tolerances are set)
[I] Absolute Difference | Stats: mean=0, std-dev=0, var=0, median=0, min=0 at (0, 0, 0, 0), max=0 at (0, 0, 0, 0), avg-magnitude=0
[I] Relative Difference | Stats: mean=0, std-dev=0, var=0, median=0, min=0 at (0, 0, 0, 0), max=0 at (0, 0, 0, 0), avg-magnitude=0
[I] PASSED | Output: 'Transpose__10:0' | Difference is within tolerance (rel=0.1, abs=0.1)
[I] Comparing Output: 'Resize__17:0' (dtype=float32, shape=(1, 90, 256, 24)) with 'Resize__17:0' (dtype=float32, shape=(1, 90, 256, 24))
[I] Tolerance: [abs=0.1, rel=0.1] | Checking elemwise error
[I] trt-runner-N0-02/16/23-23:27:03: Resize__17:0 | Stats: mean=0.49975, std-dev=0.28933, var=0.083713, median=0.50058, min=1.0369e-05 at (0, 58, 87, 19), max=0.99999 at (0, 15, 71, 23), avg-magnitude=0.49975
[I] ---- Histogram ----
Bin Range | Num Elems | Visualization
(1.04e-05, 0.1) | 56111 | ########################################
(0.1 , 0.2) | 55311 | #######################################
(0.2 , 0.3) | 55494 | #######################################
(0.3 , 0.4) | 54449 | ######################################
(0.4 , 0.5) | 54803 | #######################################
(0.5 , 0.6) | 55559 | #######################################
(0.6 , 0.7) | 55078 | #######################################
(0.7 , 0.8) | 55530 | #######################################
(0.8 , 0.9) | 55028 | #######################################
(0.9 , 1 ) | 55597 | #######################################
[I] onnxrt-runner-N0-02/16/23-23:27:03: Resize__17:0 | Stats: mean=0.49967, std-dev=0.28929, var=0.083687, median=0.50057, min=1.0369e-05 at (0, 58, 86, 18), max=0.99999 at (0, 15, 70, 22), avg-magnitude=0.49967
[I] ---- Histogram ----
Bin Range | Num Elems | Visualization
(1.04e-05, 0.1) | 56056 | ########################################
(0.1 , 0.2) | 55292 | #######################################
(0.2 , 0.3) | 55628 | #######################################
(0.3 , 0.4) | 54380 | ######################################
(0.4 , 0.5) | 54828 | #######################################
(0.5 , 0.6) | 55616 | #######################################
(0.6 , 0.7) | 55144 | #######################################
(0.7 , 0.8) | 55504 | #######################################
(0.8 , 0.9) | 54932 | #######################################
(0.9 , 1 ) | 55580 | #######################################
[I] Error Metrics: Resize__17:0
[I] Minimum Required Tolerance: elemwise error | [abs=0.9989] OR [rel=80209] (requirements may be lower if both abs/rel tolerances are set)
[I] Absolute Difference | Stats: mean=0.24374, std-dev=0.25036, var=0.06268, median=0.17254, min=0 at (0, 0, 0, 0), max=0.9989 at (0, 38, 129, 4), avg-magnitude=0.24374
[I] ---- Histogram ----
Bin Range | Num Elems | Visualization
(0 , 0.0999) | 225868 | ########################################
(0.0999, 0.2 ) | 68455 | ############
(0.2 , 0.3 ) | 60353 | ##########
(0.3 , 0.4 ) | 52322 | #########
(0.4 , 0.499 ) | 44280 | #######
(0.499 , 0.599 ) | 36490 | ######
(0.599 , 0.699 ) | 28309 | #####
(0.699 , 0.799 ) | 20511 | ###
(0.799 , 0.899 ) | 12273 | ##
(0.899 , 0.999 ) | 4099 |
[I] Relative Difference | Stats: mean=4.3466, std-dev=211.49, var=44727, median=0.36421, min=0 at (0, 0, 0, 0), max=80209 at (0, 79, 89, 16), avg-magnitude=4.3466
[I] ---- Histogram ----
Bin Range | Num Elems | Visualization
(0 , 8.02e+03) | 552930 | ########################################
(8.02e+03, 1.6e+04 ) | 15 |
(1.6e+04 , 2.41e+04) | 6 |
(2.41e+04, 3.21e+04) | 2 |
(3.21e+04, 4.01e+04) | 1 |
(4.01e+04, 4.81e+04) | 4 |
(4.81e+04, 5.61e+04) | 1 |
(5.61e+04, 6.42e+04) | 0 |
(6.42e+04, 7.22e+04) | 0 |
(7.22e+04, 8.02e+04) | 1 |
[E] FAILED | Output: 'Resize__17:0' | Difference exceeds tolerance (rel=0.1, abs=0.1)
[I] Comparing Output: 'tf_op_layer_ResizeNearestNeighbor' (dtype=float32, shape=(1, 256, 24, 90)) with 'tf_op_layer_ResizeNearestNeighbor' (dtype=float32, shape=(1, 256, 24, 90))
[I] Tolerance: [abs=0.1, rel=0.1] | Checking elemwise error
[I] trt-runner-N0-02/16/23-23:27:03: tf_op_layer_ResizeNearestNeighbor | Stats: mean=0.49975, std-dev=0.28933, var=0.083713, median=0.50058, min=1.0369e-05 at (0, 87, 19, 58), max=0.99999 at (0, 71, 23, 15), avg-magnitude=0.49975
[I] ---- Histogram ----
Bin Range | Num Elems | Visualization
(1.04e-05, 0.1) | 56111 | ########################################
(0.1 , 0.2) | 55311 | #######################################
(0.2 , 0.3) | 55494 | #######################################
(0.3 , 0.4) | 54449 | ######################################
(0.4 , 0.5) | 54803 | #######################################
(0.5 , 0.6) | 55559 | #######################################
(0.6 , 0.7) | 55078 | #######################################
(0.7 , 0.8) | 55530 | #######################################
(0.8 , 0.9) | 55028 | #######################################
(0.9 , 1 ) | 55597 | #######################################
[I] onnxrt-runner-N0-02/16/23-23:27:03: tf_op_layer_ResizeNearestNeighbor | Stats: mean=0.49967, std-dev=0.28929, var=0.083687, median=0.50057, min=1.0369e-05 at (0, 86, 18, 58), max=0.99999 at (0, 70, 22, 15), avg-magnitude=0.49967
[I] ---- Histogram ----
Bin Range | Num Elems | Visualization
(1.04e-05, 0.1) | 56056 | ########################################
(0.1 , 0.2) | 55292 | #######################################
(0.2 , 0.3) | 55628 | #######################################
(0.3 , 0.4) | 54380 | ######################################
(0.4 , 0.5) | 54828 | #######################################
(0.5 , 0.6) | 55616 | #######################################
(0.6 , 0.7) | 55144 | #######################################
(0.7 , 0.8) | 55504 | #######################################
(0.8 , 0.9) | 54932 | #######################################
(0.9 , 1 ) | 55580 | #######################################
[I] Error Metrics: tf_op_layer_ResizeNearestNeighbor
[I] Minimum Required Tolerance: elemwise error | [abs=0.9989] OR [rel=80209] (requirements may be lower if both abs/rel tolerances are set)
[I] Absolute Difference | Stats: mean=0.24374, std-dev=0.25036, var=0.06268, median=0.17254, min=0 at (0, 0, 0, 0), max=0.9989 at (0, 129, 4, 38), avg-magnitude=0.24374
[I] ---- Histogram ----
Bin Range | Num Elems | Visualization
(0 , 0.0999) | 225868 | ########################################
(0.0999, 0.2 ) | 68455 | ############
(0.2 , 0.3 ) | 60353 | ##########
(0.3 , 0.4 ) | 52322 | #########
(0.4 , 0.499 ) | 44280 | #######
(0.499 , 0.599 ) | 36490 | ######
(0.599 , 0.699 ) | 28309 | #####
(0.699 , 0.799 ) | 20511 | ###
(0.799 , 0.899 ) | 12273 | ##
(0.899 , 0.999 ) | 4099 |
[I] Relative Difference | Stats: mean=4.3466, std-dev=211.49, var=44727, median=0.36421, min=0 at (0, 0, 0, 0), max=80209 at (0, 89, 16, 79), avg-magnitude=4.3466
[I] ---- Histogram ----
Bin Range | Num Elems | Visualization
(0 , 8.02e+03) | 552930 | ########################################
(8.02e+03, 1.6e+04 ) | 15 |
(1.6e+04 , 2.41e+04) | 6 |
(2.41e+04, 3.21e+04) | 2 |
(3.21e+04, 4.01e+04) | 1 |
(4.01e+04, 4.81e+04) | 4 |
(4.81e+04, 5.61e+04) | 1 |
(5.61e+04, 6.42e+04) | 0 |
(6.42e+04, 7.22e+04) | 0 |
(7.22e+04, 8.02e+04) | 1 |
[E] FAILED | Output: 'tf_op_layer_ResizeNearestNeighbor' | Difference exceeds tolerance (rel=0.1, abs=0.1)
[E] FAILED | Mismatched outputs: ['Resize__17:0', 'tf_op_layer_ResizeNearestNeighbor']
[E] Accuracy Summary | trt-runner-N0-02/16/23-23:27:03 vs. onnxrt-runner-N0-02/16/23-23:27:03 | Passed: 0/1 iterations | Pass Rate: 0.0%
[E] FAILED | Runtime: 4.653s | Command: /usr/local/bin/polygraphy run /green/temp/resize_dummy_explicit_dims_batch_1.onnx --workspace 1G --atol 1e-1 --rtol 1e-1 --trt --onnxrt --trt-outputs mark all --onnx-outputs mark all
Also notice the high tolerance picked for mistakes, the default is 1e-5.
I believe the issue is in building the TRT op from the current attributes of the ONNX op. causing some other mathematic operation to be done that causes the difference in the output.
Since the actual implementation of the TRT op isnt available, I hope you could find the bug and solve it.
I would also appreciate a workaround until it is solved.
Thanks in advance
Environment
TensorRT Version : 7.1.3
NVIDIA GPU : Jetson AGX Xavier
NVIDIA Driver Version : l4t 32.5.1
CUDA Version : 10.2
CUDNN Version :
ONNX runtime : onnxruntime 1.10.0
ONNX opset : 11
Operating System : Ubuntu 18.04.5 LTS
Python Version (if applicable) : 3.6.9
Tensorflow Version (if applicable) : 2.3 (original model was created with TF, then converted to ONNX)
PyTorch Version (if applicable) :
Baremetal or Container (if so, version) : custom container
Relevant files to reproduce
link to a ONNX model that has the issue when converted to TRT