Onnx output differs largely to TRT engine output

lqs1 · February 18, 2023, 5:56am

Description

I have an onnx model, whose output has beed verified to be almost identical with my original PyTorch model.

After I convert it to tensorrt engine, the output changes too much. I don’t know if there is any tool I can use to debug and locate the place where error produced between onnx and trt?

Environment

nvidia docker container 22.12

Relevant Files

Here is the onnx model: https://cloud.tsinghua.edu.cn/f/8e1a7623952946c7bb76/?dl=1

Steps To Reproduce

Use this script to reproduce:

import os
import torch
import torch.nn as nn
import tensorrt as trt

TRT_LOGGER = trt.Logger()
trt.init_libnvinfer_plugins(TRT_LOGGER, '')

def load_engine(engine_file_path):
    assert os.path.exists(engine_file_path)
    print("Reading engine from file {}".format(engine_file_path))
    with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
        return runtime.deserialize_cuda_engine(f.read())

from torch.testing._internal.common_utils import numpy_to_torch_dtype_dict
def get_trt_stuff(engine_path):
    engine = load_engine(engine_path)
    context = engine.create_execution_context()
    inputs_dict = {}
    outputs_dict = {}
    bindings = []
    for binding in engine:
        binding_idx = engine.get_binding_index(binding)
        # size = trt.volume(context.get_binding_shape(binding_idx))
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        shape = tuple(context.get_binding_shape(binding_idx))
        if engine.binding_is_input(binding):
            inputs_dict[binding] = torch.empty(*shape, dtype=numpy_to_torch_dtype_dict[dtype], device='cuda')
            bindings.append(int(inputs_dict[binding].data_ptr()))
        else:
            outputs_dict[binding] = torch.empty(*shape, dtype=numpy_to_torch_dtype_dict[dtype], device='cuda')
            bindings.append(int(outputs_dict[binding].data_ptr()))
    return context, bindings, inputs_dict, outputs_dict

def run_trt(context, bindings, stream=None):
    if stream is None:
        stream = torch.cuda.default_stream()
    state = context.execute_async_v2(bindings=bindings, stream_handle=stream.cuda_stream)
    stream.synchronize()
    return state

class TRTModule(nn.Module):
    def __init__(self, engine_path):
        super().__init__()
        self.context, self.bindings, self.inputs_dict, self.outputs_dict = get_trt_stuff(engine_path)
    def forward(self, *inputs, **kw_args):
        device = 'cpu'
        for i, inp in enumerate(inputs):
            self.inputs_dict['input_{}'.format(i)].copy_(inp)
            device = inp.device
        shift = len(inputs)
        for k in kw_args:
            self.inputs_dict['input_{}'.format(shift)].copy_(kw_args[k])
            shift += 1
        state = run_trt(self.context, self.bindings)
        if not state:
            raise Exception("trt engine execution failed")
        outputs = []
        for i in range(len(self.outputs_dict)):
            outputs.append(self.outputs_dict['output_{}'.format(i)].cpu().to(device))
        if len(outputs) == 1:
            outputs = outputs[0]
        return outputs

import onnxruntime as ort

def get_ort_stuff(onnx_path, providers):
    return ort.InferenceSession(onnx_path, providers=providers)

class ORTModule(nn.Module):
    def __init__(self, onnx_path, providers=['CUDAExecutionProvider', 'CPUExecutionProvider']):
        super().__init__()
        self.sess = get_ort_stuff(onnx_path, providers)
    def forward(self, *inputs, **kw_args):
        device = 'cpu'
        for inp in inputs:
            device = inp.device
        for k in kw_args:
            device = kw_args[k].device
        inputs_dict = {'input_{}'.format(i):x.cpu().numpy() if isinstance(x, torch.Tensor) else x for i, x in enumerate(inputs)}
        shift = len(inputs_dict)
        for k in kw_args:
            inputs_dict['input_{}'.format(shift)] = kw_args[k].cpu().numpy()
            shift += 1
        outputs = self.sess.run(None, inputs_dict)
        outputs = [torch.from_numpy(x).to(device) for x in outputs]
        if len(outputs) == 1:
            outputs = outputs[0]
        return outputs

input_0 = torch.randn(2, 3, 256, 256, dtype=torch.float32).cuda()
input_1 = torch.tensor([1, 3], dtype=torch.int32).cuda()
input_2 = torch.randn(2, 81, 640, dtype=torch.float32).cuda()

ort = ORTModule('onnx/EfficientUNetModel_.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
os.system('trtexec --onnx=onnx/EfficientUNetModel_.onnx --saveEngine=onnx/EfficientUNetModel_.trt --fp16 --buildOnly')
trt = TRTModule('onnx/EfficientUNetModel_.trt')
out_ort = ort(input_0, input_1, input_2)
out_trt = trt(input_0, input_1, input_2)
print((out_ort-out_trt).abs().max().item())

Before run it, you should pip install onnxruntime-gpu.

It shows the absolute error is ~4.6 that large.

Also refer to Onnx output differs largely to TRT engine output

jasxu · February 20, 2023, 10:02am

We have a tool called polygraphy can be used to debug accuracy issues, see TensorRT/tools/Polygraphy/examples/cli/run/01_comparing_frameworks at main · NVIDIA/TensorRT · GitHub
And for your case, you can first try to run with FP32 precision and see if the accuracy issue only happens to FP16? If only FP16 fails, it’s usually caused by FP16 overflow, e.g. the output of some accumulate operations is larger than 65504. This can be fixed by forcing FP32 precision for those problematic layers/tensors.

lqs1 · February 21, 2023, 12:07pm

I have checked that there’s no fp16 overflow… So is there any other possible reason for the output error?

lqs1 · February 22, 2023, 3:06am

This is the simplified problem for now:

I have a very simple onnx file (Where I locate the problematic sub-network): https://cloud.tsinghua.edu.cn/f/5db9c79dc5a841ada575/?dl=1

When I use polygraphy run onnx/sample.onnx --trt --fp16 --onnxrt to test it. The output engine makes very large error.

So how to fix this problem?

jasxu · February 22, 2023, 3:55am

Can you share the FP32/FP16 error report from Polygraphy?
And you can use " --trt-outputs mark all --onnx-outputs mark all " to dump per-layer accuracy result, it can help you root cause the problematic layer and understand the error porpogation.

lqs1 · February 22, 2023, 6:58am

Thank you for your suggestion. As the per-layer polygraphy shows, InstanceNormalization contributes most error:

[I]     Comparing Output: '/input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_output_0' (dtype=fl
oat32, shape=(2, 32, 262144)) with '/input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_output_0' 
(dtype=float16, shape=(2, 32, 262144))                                                                           
[I]         Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error                                          
[I]         trt-runner-N0-02/22/23-06:48:38: /input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_$
utput_0 | Stats: mean=-1.7115e-09, std-dev=0.005532, var=3.0603e-05, median=0, min=-0.042147 at (1, 13, 123060), 
max=0.045272 at (0, 11, 47871), avg-magnitude=0.0031986                                                          
[I]             ---- Histogram ----                                                                              
                Bin Range        |  Num Elems | Visualization                                                    
                (-5.38 , -4.27 ) |          0 |                                                                  
                (-4.27 , -3.15 ) |          0 |                                                                  
                (-3.15 , -2.03 ) |          0 |                                                                  
                (-2.03 , -0.916) |          0 |                                                                  
                (-0.916, 0.201 ) |   16777216 | ########################################                         
                (0.201 , 1.32  ) |          0 |                                                                  
                (1.32  , 2.43  ) |          0 |                                                                  
                (2.43  , 3.55  ) |          0 |                                                                  
                (3.55  , 4.67  ) |          0 |
                (4.67  , 5.79  ) |          0 |
[I]         onnxrt-runner-N0-02/22/23-06:48:38: /input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalizati$
n_output_0 | Stats: mean=2.0576e-07, std-dev=0.99961, var=0.99923, median=0.053864, min=-5.3828 at (1, 13, 12306$
), max=5.7852 at (0, 11, 47871), avg-magnitude=0.82224
[I]             ---- Histogram ----
                Bin Range        |  Num Elems | Visualization
                (-5.38 , -4.27 ) |         62 |
                (-4.27 , -3.15 ) |       4321 |
                (-3.15 , -2.03 ) |     246204 | #
                (-2.03 , -0.916) |    3273482 | ######################
                (-0.916, 0.201 ) |    5860698 | ########################################
                (0.201 , 1.32  ) |    5725682 | #######################################
                (1.32  , 2.44  ) |    1631536 | ###########
                (2.44  , 3.55  ) |      34636 |
                (3.55  , 4.67  ) |        578 |
                (4.67  , 5.79  ) |         17 |
[I]         Error Metrics: /input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_output_0           
[I]             Minimum Required Tolerance: elemwise error | [abs=5.7399] OR [rel=1.0935] (requirements may be lo
wer if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.81905, std-dev=0.56622, var=0.3206, median=0.74607, min=3.40$
4e-07 at (0, 1, 150973), max=5.7399 at (0, 11, 47871), avg-magnitude=0.81905
[I]                 ---- Histogram ----
                    Bin Range        |  Num Elems | Visualization
                    (3.4e-07, 0.574) |    6758688 | ########################################
                    (0.574  , 1.15 ) |    5077625 | ##############################
                    (1.15   , 1.72 ) |    3851361 | ######################
                    (1.72   , 2.3  ) |     936867 | #####
                    (2.3    , 2.87 ) |     134072 |
                    (2.87   , 3.44 ) |      16417 |
                    (3.44   , 4.02 ) |       1931 |
                    (4.02   , 4.59 ) |        222 |
                    (4.59   , 5.17 ) |         29 |
                    (5.17   , 5.74 ) |          4 |
[I]             Relative Difference | Stats: mean=0.99609, std-dev=0.003915, var=1.5327e-05, median=0.99998, min$
0.84108 at (1, 7, 46214), max=1.0935 at (1, 7, 112736), avg-magnitude=0.99609
[I]                 ---- Histogram ----
                    Bin Range      |  Num Elems | Visualization
                    (0.841, 0.866) |          2 |
                    (0.866, 0.892) |          2 |
                    (0.892, 0.917) |          2 |
                    (0.917, 0.942) |          5 |
                    (0.942, 0.967) |         12 |
                    (0.967, 0.993) |    8386932 | #######################################
                    (0.993, 1.02 ) |    8390241 | ########################################
                    (1.02 , 1.04 ) |         13 |
                    (1.04 , 1.07 ) |          3 |
                    (1.07 , 1.09 ) |          4 |
[E]         FAILED | Output: '/input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_output_0' | Difference exceeds tolerance (rel=1e-05, abs=1e-05)

Is there any bug for InstanceNormalization layer in TensorRT? Because my InstanceNorm layer is a 32 group GroupNorm actually.

jasxu · February 22, 2023, 7:49am

This is a common case that InstanceNormal causes accuracy issues in various models. The key problem is when using FP16 precision, instanceNorm kernel needs to accumulate per channel/batch elements under FP16 precision. Although the final result of InstanceNorm will not overflow but the accumulcator result might overflow before executing the division operation. That causes the accuracy issue.

The recommanded way to fix this issue is to mark the precision of InstanceNorm layer as FP32. Polygraphy has provided such functionality with “–layerPrecision”/“–tensorPrecision”. You can check the Polygraphy --help to see how to use it. And trtexec should also has similiar options.

lqs1 · February 22, 2023, 7:58am

Thank you for your explanation! However, I have a very large model. Is there any example of using python to optimize the onnx to trt? Otherwise, I have to list all InstanceNorm layer names in trtexec command line… (Or if the trtexec command line flags support regex expression?)

jasxu · February 22, 2023, 8:04am

You can use graph surgeon to optimize your ONNX model, it’s easy to use. See TensorRT/tools/onnx-graphsurgeon/examples/04_modifying_a_model at main · NVIDIA/TensorRT · GitHub

As for your second question, as far as I know, trtexec does not support regex expressions. We might add it in future. As for now, you can write a script to search all instance norm layers and generate the trtexec cmd.

lqs1 · February 22, 2023, 8:26am

I tried the layer precision flag in polygraphy. However, I still got large error when norm becomes fp32. Is there anything wrong with my command?

polygraphy run onnx/sample.onnx --trt --fp16 --precision-constraints=obey --layer-precisions=/input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization:float32 --onnxrt --trt-outputs mark all --onnx-outputs mark all

[I]     Comparing Output: '/input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_output_0' (dtype=fl
oat32, shape=(2, 32, 262144)) with '/input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_output_0' 
(dtype=float16, shape=(2, 32, 262144))                                                                           
[I]         Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error                                          
[I]         trt-runner-N0-02/22/23-08:38:32: /input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_o
utput_0 | Stats: mean=-1.8423e-09, std-dev=0.005532, var=3.0603e-05, median=0, min=-0.042147 at (1, 13, 123060), 
max=0.045272 at (0, 11, 47871), avg-magnitude=0.0031986                                                          
[I]             ---- Histogram ----                                                                              
                Bin Range        |  Num Elems | Visualization                                                    
                (-5.38 , -4.27 ) |          0 |                                                                  
                (-4.27 , -3.15 ) |          0 |                                                                  
                (-3.15 , -2.03 ) |          0 |                                                                  
                (-2.03 , -0.916) |          0 |                                                                  
                (-0.916, 0.201 ) |   16777216 | ########################################                         
                (0.201 , 1.32  ) |          0 |                                                                  
                (1.32  , 2.43  ) |          0 |                                                                  
                (2.43  , 3.55  ) |          0 |                                                                  
                (3.55  , 4.67  ) |          0 |                                                                  
                (4.67  , 5.79  ) |          0 |                                                                  
[I]         onnxrt-runner-N0-02/22/23-08:38:32: /input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalizatio
n_output_0 | Stats: mean=2.0576e-07, std-dev=0.99961, var=0.99923, median=0.053864, min=-5.3828 at (1, 13, 123060
), max=5.7852 at (0, 11, 47871), avg-magnitude=0.82224                                                           
[I]             ---- Histogram ----                                                                              
                Bin Range        |  Num Elems | Visualization                                                    
                (-5.38 , -4.27 ) |         62 |                                                                  
                (-4.27 , -3.15 ) |       4321 |                                                                  
                (-3.15 , -2.03 ) |     246204 | #                                                                
                (-2.03 , -0.916) |    3273482 | ######################                                           
                (-0.916, 0.201 ) |    5860698 | ########################################                         
                (0.201 , 1.32  ) |    5725682 | #######################################                          
                (1.32  , 2.44  ) |    1631536 | ###########                                                      
                (2.44  , 3.55  ) |      34636 | 
                (3.55  , 4.67  ) |        578 |
                (4.67  , 5.79  ) |         17 |
[I]         Error Metrics: /input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_output_0
[I]             Minimum Required Tolerance: elemwise error | [abs=5.7399] OR [rel=1.0935] (requirements may be lo
wer if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.81905, std-dev=0.56622, var=0.3206, median=0.74607, min=3.400
4e-07 at (0, 1, 150973), max=5.7399 at (0, 11, 47871), avg-magnitude=0.81905
[I]                 ---- Histogram ----
                    Bin Range        |  Num Elems | Visualization
                    (3.4e-07, 0.574) |    6758688 | ########################################
                    (0.574  , 1.15 ) |    5077625 | ##############################
                    (1.15   , 1.72 ) |    3851361 | ######################
                    (1.72   , 2.3  ) |     936867 | #####
                    (2.3    , 2.87 ) |     134072 |
                    (2.87   , 3.44 ) |      16417 |
                    (3.44   , 4.02 ) |       1931 |
                    (4.02   , 4.59 ) |        222 |
                    (4.59   , 5.17 ) |         29 |
                    (5.17   , 5.74 ) |          4 |
[I]             Relative Difference | Stats: mean=0.99609, std-dev=0.003915, var=1.5327e-05, median=1, min=0.8410
8 at (1, 7, 46214), max=1.0935 at (1, 7, 112736), avg-magnitude=0.99609
[I]                 ---- Histogram ----
                    Bin Range      |  Num Elems | Visualization
                    (0.841, 0.866) |          2 |
                    (0.866, 0.892) |          2 |
                    (0.892, 0.917) |          2 |
                    (0.917, 0.942) |          5 |
                    (0.942, 0.967) |         12 |
                    (0.967, 0.993) |    8386932 | #######################################
                    (0.993, 1.02 ) |    8390241 | ########################################
                    (1.02 , 1.04 ) |         13 |
                    (1.04 , 1.07 ) |          3 |
                    (1.07 , 1.09 ) |          4 |

jasxu · February 22, 2023, 9:00am

Can you add "-v -v -v " options to Polygraphy and share the whole Polygraphy log to me?

lqs1 · February 22, 2023, 9:03am

Here is the log: https://cloud.tsinghua.edu.cn/f/85eee484ca024503bd31/?dl=1

It seems that this time the error becomes incredibly larger…

jasxu · February 22, 2023, 10:28am

Yes, I think this seems like a TRT bug. I can repro it in TRT 8.5, but seems this issue has been fixed in TRT 8.6. You can wait the TRT 8.6 release to verify it, which should be released next month.

lqs1 · February 22, 2023, 10:55am

Alright… Looking forward to the new release!

lqs1 · February 25, 2023, 3:58pm

Moreover, is there any workaround before the new release comes up? I notice there are group normalization plugins in the repo. However, they don’t have any docs or readmes. I wonder if I can use them? And If yes, how?

Topic		Replies	Views
Accuracy drop in resize op when converting from ONNX to TRT FP32 TensorRT	5	1327	June 23, 2023
Onnx -> tensorrt fp32 conversion performance degradation different outputs TensorRT tensorrt , pytorch , onnx	4	2026	November 29, 2022
Tensorrt loss accuracy when test TensorRT tensorrt	6	2078	February 24, 2022
Onnx to TensorRT mismatch Jetson Orin NX tensorrt , cuda , cudnn , onnx	11	968	January 15, 2024
LSTM ONNX to TensorRT mismatched outputs TensorRT tensorrt	3	943	September 29, 2022
tensorRT inference unstable compared onnxruntime TensorRT	4	1306	May 4, 2021
Issues with torch.nn.ReflectionPad2d(padding) conversion to TRT engine TensorRT tensorrt , pytorch , onnx	21	4163	February 8, 2022
Inference result gets worse when converting pytorch model to TensorRT model TensorRT pytorch	6	1119	January 19, 2022
TensorRT get different result in python and c++ TensorRT	21	2862	August 24, 2022
Model onnx trt engine generation process report different results compared between PC and jetson XAVIER NX Jetson Xavier NX tensorrt	19	1013	September 28, 2022

Onnx output differs largely to TRT engine output

Description

Environment

Relevant Files

Steps To Reproduce

Related topics