Under the int8 mode, the output of onnxruntime and tensorRT are inconsistent

Description

I have an int8 onnx model which has quantiz node and dequantize node, then I deploy it by onnxruntime and tensorRT, but I met a confusing output difference between them.
There are two outputs during the model inference, I marked these output nodes with 434 and 448.
If I only mark one of them, for example, only 434 as my output node and ignoring 448, there is barely gap between onnxruntime-model-output and TensorRT-model-output. The same is 448 as output node and 434 ignored.
However, if I try to get them both, I only get correct values in 434, while the values of 448 between the two models are totally different.
More confusing, if I mark 448’s last convolutional output from previous layer as Node446, which means 434/446/448 are all my outputs, I get all correct results among the three nodes.
I don’t know why this happened.

Environment

TensorRT Version: 8.4.1.5
GPU Type: V100
Nvidia Driver Version: 450.142.00
CUDA Version: 11.0
CUDNN Version:
Operating System + Version:
Python Version (if applicable): 3.8.13
PyTorch Version (if applicable): 1.10.2
onnxruntime-gpu: 1.11.1

Relevant Files

debug.zip (18.0 MB)
This debug.zip contains scripts, dummy_input and onnx model to reproduce the above phenomenon

Steps To Reproduce

python compare_onnx_trt_subgraph.py

the output may like

outputs nodes: ['434']
434  l1 loss between onnx and trt:  1.0309417120879516e-05

outputs nodes: ['448']
448  l1 loss between onnx and trt:  9.65373601502506e-06

outputs nodes: ['434', '448']
434  l1 loss between onnx and trt:  1.0309417120879516e-05
448  l1 loss between onnx and trt:  0.14752599596977234

outputs nodes: ['434', '448', '446']
434  l1 loss between onnx and trt:  1.0309417120879516e-05
446  l1 loss between onnx and trt:  1.5866717149037868e-05
448  l1 loss between onnx and trt:  9.65373601502506e-06

Hi, Please refer to the below links to perform inference in INT8

Thanks!

I think my code which used to convert the onnx to tensorrt and inference has no bugs. Maybe there are some bugs when tensorrt do operator (quantize and dequantize) fusion. Some optimizations caused the memory used by 448-related nodes to be accidentally modified so that the correct results could not be obtained. It would be greatly appreciated if you could read the code again and give some reasons why the error occurs

But if I mark 446-node as output, the entire right branch may not participate in the optimization, so the result at this time is as expected
By the way, 446-node is the output node of the conv in the right branch