Onnx output differs largely to TRT engine output

Description

I have an onnx model, whose output has beed verified to be almost identical with my original PyTorch model.

After I convert it to tensorrt engine, the output changes too much. I don’t know if there is any tool I can use to debug and locate the place where error produced between onnx and trt?

Environment

nvidia docker container 22.12

Relevant Files

Here is the onnx model: https://cloud.tsinghua.edu.cn/f/8e1a7623952946c7bb76/?dl=1

Steps To Reproduce

Use this script to reproduce:

import os
import torch
import torch.nn as nn
import tensorrt as trt

TRT_LOGGER = trt.Logger()
trt.init_libnvinfer_plugins(TRT_LOGGER, '')

def load_engine(engine_file_path):
    assert os.path.exists(engine_file_path)
    print("Reading engine from file {}".format(engine_file_path))
    with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
        return runtime.deserialize_cuda_engine(f.read())

from torch.testing._internal.common_utils import numpy_to_torch_dtype_dict
def get_trt_stuff(engine_path):
    engine = load_engine(engine_path)
    context = engine.create_execution_context()
    inputs_dict = {}
    outputs_dict = {}
    bindings = []
    for binding in engine:
        binding_idx = engine.get_binding_index(binding)
        # size = trt.volume(context.get_binding_shape(binding_idx))
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        shape = tuple(context.get_binding_shape(binding_idx))
        if engine.binding_is_input(binding):
            inputs_dict[binding] = torch.empty(*shape, dtype=numpy_to_torch_dtype_dict[dtype], device='cuda')
            bindings.append(int(inputs_dict[binding].data_ptr()))
        else:
            outputs_dict[binding] = torch.empty(*shape, dtype=numpy_to_torch_dtype_dict[dtype], device='cuda')
            bindings.append(int(outputs_dict[binding].data_ptr()))
    return context, bindings, inputs_dict, outputs_dict

def run_trt(context, bindings, stream=None):
    if stream is None:
        stream = torch.cuda.default_stream()
    state = context.execute_async_v2(bindings=bindings, stream_handle=stream.cuda_stream)
    stream.synchronize()
    return state

class TRTModule(nn.Module):
    def __init__(self, engine_path):
        super().__init__()
        self.context, self.bindings, self.inputs_dict, self.outputs_dict = get_trt_stuff(engine_path)
    def forward(self, *inputs, **kw_args):
        device = 'cpu'
        for i, inp in enumerate(inputs):
            self.inputs_dict['input_{}'.format(i)].copy_(inp)
            device = inp.device
        shift = len(inputs)
        for k in kw_args:
            self.inputs_dict['input_{}'.format(shift)].copy_(kw_args[k])
            shift += 1
        state = run_trt(self.context, self.bindings)
        if not state:
            raise Exception("trt engine execution failed")
        outputs = []
        for i in range(len(self.outputs_dict)):
            outputs.append(self.outputs_dict['output_{}'.format(i)].cpu().to(device))
        if len(outputs) == 1:
            outputs = outputs[0]
        return outputs

import onnxruntime as ort

def get_ort_stuff(onnx_path, providers):
    return ort.InferenceSession(onnx_path, providers=providers)

class ORTModule(nn.Module):
    def __init__(self, onnx_path, providers=['CUDAExecutionProvider', 'CPUExecutionProvider']):
        super().__init__()
        self.sess = get_ort_stuff(onnx_path, providers)
    def forward(self, *inputs, **kw_args):
        device = 'cpu'
        for inp in inputs:
            device = inp.device
        for k in kw_args:
            device = kw_args[k].device
        inputs_dict = {'input_{}'.format(i):x.cpu().numpy() if isinstance(x, torch.Tensor) else x for i, x in enumerate(inputs)}
        shift = len(inputs_dict)
        for k in kw_args:
            inputs_dict['input_{}'.format(shift)] = kw_args[k].cpu().numpy()
            shift += 1
        outputs = self.sess.run(None, inputs_dict)
        outputs = [torch.from_numpy(x).to(device) for x in outputs]
        if len(outputs) == 1:
            outputs = outputs[0]
        return outputs

input_0 = torch.randn(2, 3, 256, 256, dtype=torch.float32).cuda()
input_1 = torch.tensor([1, 3], dtype=torch.int32).cuda()
input_2 = torch.randn(2, 81, 640, dtype=torch.float32).cuda()

ort = ORTModule('onnx/EfficientUNetModel_.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
os.system('trtexec --onnx=onnx/EfficientUNetModel_.onnx --saveEngine=onnx/EfficientUNetModel_.trt --fp16 --buildOnly')
trt = TRTModule('onnx/EfficientUNetModel_.trt')
out_ort = ort(input_0, input_1, input_2)
out_trt = trt(input_0, input_1, input_2)
print((out_ort-out_trt).abs().max().item())

Before run it, you should pip install onnxruntime-gpu.

It shows the absolute error is ~4.6 that large.

Also refer to Onnx output differs largely to TRT engine output

We have a tool called polygraphy can be used to debug accuracy issues, see TensorRT/tools/Polygraphy/examples/cli/run/01_comparing_frameworks at main · NVIDIA/TensorRT · GitHub
And for your case, you can first try to run with FP32 precision and see if the accuracy issue only happens to FP16? If only FP16 fails, it’s usually caused by FP16 overflow, e.g. the output of some accumulate operations is larger than 65504. This can be fixed by forcing FP32 precision for those problematic layers/tensors.

I have checked that there’s no fp16 overflow… So is there any other possible reason for the output error?

This is the simplified problem for now:

I have a very simple onnx file (Where I locate the problematic sub-network): https://cloud.tsinghua.edu.cn/f/5db9c79dc5a841ada575/?dl=1

When I use polygraphy run onnx/sample.onnx --trt --fp16 --onnxrt to test it. The output engine makes very large error.

So how to fix this problem?

Can you share the FP32/FP16 error report from Polygraphy?
And you can use " --trt-outputs mark all --onnx-outputs mark all " to dump per-layer accuracy result, it can help you root cause the problematic layer and understand the error porpogation.

Thank you for your suggestion. As the per-layer polygraphy shows, InstanceNormalization contributes most error:

[I]     Comparing Output: '/input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_output_0' (dtype=fl
oat32, shape=(2, 32, 262144)) with '/input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_output_0' 
(dtype=float16, shape=(2, 32, 262144))                                                                           
[I]         Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error                                          
[I]         trt-runner-N0-02/22/23-06:48:38: /input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_$
utput_0 | Stats: mean=-1.7115e-09, std-dev=0.005532, var=3.0603e-05, median=0, min=-0.042147 at (1, 13, 123060), 
max=0.045272 at (0, 11, 47871), avg-magnitude=0.0031986                                                          
[I]             ---- Histogram ----                                                                              
                Bin Range        |  Num Elems | Visualization                                                    
                (-5.38 , -4.27 ) |          0 |                                                                  
                (-4.27 , -3.15 ) |          0 |                                                                  
                (-3.15 , -2.03 ) |          0 |                                                                  
                (-2.03 , -0.916) |          0 |                                                                  
                (-0.916, 0.201 ) |   16777216 | ########################################                         
                (0.201 , 1.32  ) |          0 |                                                                  
                (1.32  , 2.43  ) |          0 |                                                                  
                (2.43  , 3.55  ) |          0 |                                                                  
                (3.55  , 4.67  ) |          0 |
                (4.67  , 5.79  ) |          0 |
[I]         onnxrt-runner-N0-02/22/23-06:48:38: /input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalizati$
n_output_0 | Stats: mean=2.0576e-07, std-dev=0.99961, var=0.99923, median=0.053864, min=-5.3828 at (1, 13, 12306$
), max=5.7852 at (0, 11, 47871), avg-magnitude=0.82224
[I]             ---- Histogram ----
                Bin Range        |  Num Elems | Visualization
                (-5.38 , -4.27 ) |         62 |
                (-4.27 , -3.15 ) |       4321 |
                (-3.15 , -2.03 ) |     246204 | #
                (-2.03 , -0.916) |    3273482 | ######################
                (-0.916, 0.201 ) |    5860698 | ########################################
                (0.201 , 1.32  ) |    5725682 | #######################################
                (1.32  , 2.44  ) |    1631536 | ###########
                (2.44  , 3.55  ) |      34636 |
                (3.55  , 4.67  ) |        578 |
                (4.67  , 5.79  ) |         17 |
[I]         Error Metrics: /input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_output_0           
[I]             Minimum Required Tolerance: elemwise error | [abs=5.7399] OR [rel=1.0935] (requirements may be lo
wer if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.81905, std-dev=0.56622, var=0.3206, median=0.74607, min=3.40$
4e-07 at (0, 1, 150973), max=5.7399 at (0, 11, 47871), avg-magnitude=0.81905
[I]                 ---- Histogram ----
                    Bin Range        |  Num Elems | Visualization
                    (3.4e-07, 0.574) |    6758688 | ########################################
                    (0.574  , 1.15 ) |    5077625 | ##############################
                    (1.15   , 1.72 ) |    3851361 | ######################
                    (1.72   , 2.3  ) |     936867 | #####
                    (2.3    , 2.87 ) |     134072 |
                    (2.87   , 3.44 ) |      16417 |
                    (3.44   , 4.02 ) |       1931 |
                    (4.02   , 4.59 ) |        222 |
                    (4.59   , 5.17 ) |         29 |
                    (5.17   , 5.74 ) |          4 |
[I]             Relative Difference | Stats: mean=0.99609, std-dev=0.003915, var=1.5327e-05, median=0.99998, min$
0.84108 at (1, 7, 46214), max=1.0935 at (1, 7, 112736), avg-magnitude=0.99609
[I]                 ---- Histogram ----
                    Bin Range      |  Num Elems | Visualization
                    (0.841, 0.866) |          2 |
                    (0.866, 0.892) |          2 |
                    (0.892, 0.917) |          2 |
                    (0.917, 0.942) |          5 |
                    (0.942, 0.967) |         12 |
                    (0.967, 0.993) |    8386932 | #######################################
                    (0.993, 1.02 ) |    8390241 | ########################################
                    (1.02 , 1.04 ) |         13 |
                    (1.04 , 1.07 ) |          3 |
                    (1.07 , 1.09 ) |          4 |
[E]         FAILED | Output: '/input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_output_0' | Difference exceeds tolerance (rel=1e-05, abs=1e-05)

Is there any bug for InstanceNormalization layer in TensorRT? Because my InstanceNorm layer is a 32 group GroupNorm actually.

This is a common case that InstanceNormal causes accuracy issues in various models. The key problem is when using FP16 precision, instanceNorm kernel needs to accumulate per channel/batch elements under FP16 precision. Although the final result of InstanceNorm will not overflow but the accumulcator result might overflow before executing the division operation. That causes the accuracy issue.

The recommanded way to fix this issue is to mark the precision of InstanceNorm layer as FP32. Polygraphy has provided such functionality with “–layerPrecision”/“–tensorPrecision”. You can check the Polygraphy --help to see how to use it. And trtexec should also has similiar options.

Thank you for your explanation! However, I have a very large model. Is there any example of using python to optimize the onnx to trt? Otherwise, I have to list all InstanceNorm layer names in trtexec command line… (Or if the trtexec command line flags support regex expression?)

You can use graph surgeon to optimize your ONNX model, it’s easy to use. See TensorRT/tools/onnx-graphsurgeon/examples/04_modifying_a_model at main · NVIDIA/TensorRT · GitHub

As for your second question, as far as I know, trtexec does not support regex expressions. We might add it in future. As for now, you can write a script to search all instance norm layers and generate the trtexec cmd.

I tried the layer precision flag in polygraphy. However, I still got large error when norm becomes fp32. Is there anything wrong with my command?

polygraphy run onnx/sample.onnx --trt --fp16 --precision-constraints=obey --layer-precisions=/input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization:float32 --onnxrt --trt-outputs mark all --onnx-outputs mark all
[I]     Comparing Output: '/input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_output_0' (dtype=fl
oat32, shape=(2, 32, 262144)) with '/input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_output_0' 
(dtype=float16, shape=(2, 32, 262144))                                                                           
[I]         Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error                                          
[I]         trt-runner-N0-02/22/23-08:38:32: /input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_o
utput_0 | Stats: mean=-1.8423e-09, std-dev=0.005532, var=3.0603e-05, median=0, min=-0.042147 at (1, 13, 123060), 
max=0.045272 at (0, 11, 47871), avg-magnitude=0.0031986                                                          
[I]             ---- Histogram ----                                                                              
                Bin Range        |  Num Elems | Visualization                                                    
                (-5.38 , -4.27 ) |          0 |                                                                  
                (-4.27 , -3.15 ) |          0 |                                                                  
                (-3.15 , -2.03 ) |          0 |                                                                  
                (-2.03 , -0.916) |          0 |                                                                  
                (-0.916, 0.201 ) |   16777216 | ########################################                         
                (0.201 , 1.32  ) |          0 |                                                                  
                (1.32  , 2.43  ) |          0 |                                                                  
                (2.43  , 3.55  ) |          0 |                                                                  
                (3.55  , 4.67  ) |          0 |                                                                  
                (4.67  , 5.79  ) |          0 |                                                                  
[I]         onnxrt-runner-N0-02/22/23-08:38:32: /input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalizatio
n_output_0 | Stats: mean=2.0576e-07, std-dev=0.99961, var=0.99923, median=0.053864, min=-5.3828 at (1, 13, 123060
), max=5.7852 at (0, 11, 47871), avg-magnitude=0.82224                                                           
[I]             ---- Histogram ----                                                                              
                Bin Range        |  Num Elems | Visualization                                                    
                (-5.38 , -4.27 ) |         62 |                                                                  
                (-4.27 , -3.15 ) |       4321 |                                                                  
                (-3.15 , -2.03 ) |     246204 | #                                                                
                (-2.03 , -0.916) |    3273482 | ######################                                           
                (-0.916, 0.201 ) |    5860698 | ########################################                         
                (0.201 , 1.32  ) |    5725682 | #######################################                          
                (1.32  , 2.44  ) |    1631536 | ###########                                                      
                (2.44  , 3.55  ) |      34636 | 
                (3.55  , 4.67  ) |        578 |
                (4.67  , 5.79  ) |         17 |
[I]         Error Metrics: /input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_output_0
[I]             Minimum Required Tolerance: elemwise error | [abs=5.7399] OR [rel=1.0935] (requirements may be lo
wer if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.81905, std-dev=0.56622, var=0.3206, median=0.74607, min=3.400
4e-07 at (0, 1, 150973), max=5.7399 at (0, 11, 47871), avg-magnitude=0.81905
[I]                 ---- Histogram ----
                    Bin Range        |  Num Elems | Visualization
                    (3.4e-07, 0.574) |    6758688 | ########################################
                    (0.574  , 1.15 ) |    5077625 | ##############################
                    (1.15   , 1.72 ) |    3851361 | ######################
                    (1.72   , 2.3  ) |     936867 | #####
                    (2.3    , 2.87 ) |     134072 |
                    (2.87   , 3.44 ) |      16417 |
                    (3.44   , 4.02 ) |       1931 |
                    (4.02   , 4.59 ) |        222 |
                    (4.59   , 5.17 ) |         29 |
                    (5.17   , 5.74 ) |          4 |
[I]             Relative Difference | Stats: mean=0.99609, std-dev=0.003915, var=1.5327e-05, median=1, min=0.8410
8 at (1, 7, 46214), max=1.0935 at (1, 7, 112736), avg-magnitude=0.99609
[I]                 ---- Histogram ----
                    Bin Range      |  Num Elems | Visualization
                    (0.841, 0.866) |          2 |
                    (0.866, 0.892) |          2 |
                    (0.892, 0.917) |          2 |
                    (0.917, 0.942) |          5 |
                    (0.942, 0.967) |         12 |
                    (0.967, 0.993) |    8386932 | #######################################
                    (0.993, 1.02 ) |    8390241 | ########################################
                    (1.02 , 1.04 ) |         13 |
                    (1.04 , 1.07 ) |          3 |
                    (1.07 , 1.09 ) |          4 |

Can you add "-v -v -v " options to Polygraphy and share the whole Polygraphy log to me?

Here is the log: https://cloud.tsinghua.edu.cn/f/85eee484ca024503bd31/?dl=1

It seems that this time the error becomes incredibly larger…

Yes, I think this seems like a TRT bug. I can repro it in TRT 8.5, but seems this issue has been fixed in TRT 8.6. You can wait the TRT 8.6 release to verify it, which should be released next month.

Alright… Looking forward to the new release!

1 Like

Moreover, is there any workaround before the new release comes up? I notice there are group normalization plugins in the repo. However, they don’t have any docs or readmes. I wonder if I can use them? And If yes, how?