A weird bug: Two similar onnx but one engine with bug


I have two simple onnx files, say o1.onnx and o3.onnx. o1.onnx is a subgraph of o3.onnx, where the only difference is that o3.onnx adds two more outputs than o1.onnx.

When I transform o1.onnx to trt engine, everything works fine. However, when I transform o3.onnx to trt engine, the engine outputs large error.


official docker container 22.12

Relevant Files

Related files: https://cloud.tsinghua.edu.cn/f/09c8c8a1d6a44fa0915a/?dl=1

Steps To Reproduce

import os
from polygraphy.backend.onnxrt import OnnxrtRunner
from polygraphy.backend.trt import TrtRunner
import numpy as np

feed_dict = {'input_0': np.load('bug.npy')}

BASE = 'o3'

import onnxruntime as ort
sess = ort.InferenceSession('{}.onnx'.format(BASE), providers=['CUDAExecutionProvider'])
with OnnxrtRunner(sess) as runner:
    outputs_ort = runner.infer(feed_dict)

import tensorrt as trt

TRT_LOGGER = trt.Logger()
trt.init_libnvinfer_plugins(TRT_LOGGER, '')

def load_engine(engine_file_path):
    assert os.path.exists(engine_file_path)
    print("Reading engine from file {}".format(engine_file_path))
    with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
        return runtime.deserialize_cuda_engine(f.read())

os.system('trtexec --onnx={}.onnx --saveEngine={}.trt --fp16 --buildOnly'.format(BASE, BASE))
engine = load_engine('{}.trt'.format(BASE))
with TrtRunner(engine) as runner:
    outputs_trt = runner.infer(feed_dict)

print('max error', np.abs(outputs_ort['output_0']-outputs_trt['output_0']).max())

When BASE='o1', the max error is just 9e-6, while when BASE='o3', the max error is 30+.

Moreover, the error only produced by my input npy file (which is the real input for my model). If I use polygraph run command, the output is normal.

Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:

  1. validating your model with the below snippet


import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
2) Try running your model with trtexec command.

In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging

related files are in this link: https://cloud.tsinghua.edu.cn/f/09c8c8a1d6a44fa0915a/?dl=1

Hi, I still face this issue. It’s werid because the downstream onnx node can affect upstream ones. Is there any bug with TensorRT?