Is there any way to let ReduceMean in TRT produce exactly the same output as ONNXruntime?


I am investigating a precision problem in a gpt-2 model, using polygraphy to debug. I am not using low precision and set strict-types. I set --onnx-outputs mark all , to compare result of every layer. But the absolute difference is not zero since a ReduceMean layer and gets bigger afterwards, finally exceeding the threshold.
I guess that the difference may be because of the different order of addition, but is there any way to let ReduceMean to produce exactly the same output? I expect the output of the net to be exactly the same.


TensorRT Version:
GPU Type: T4
Nvidia Driver Version:460.73.01
CUDA Version: 10.2
CUDNN Version: 8.1.0
Operating System + Version:RedHat
Python Version (if applicable): 3.8
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

I’ll produce part of my polygraphy log output here:
[[38;5;14m[I] Comparing Output: ‘231’ (dtype=float32, shape=(1, 128, 1)) with ‘231’ (dtype=float32, shape=(1, 128, 1)) | Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error^[[0m

[I] Absolute Difference | Stats: mean=1.2617e-10, std-dev=1.1599e-10, var=1.3454e-20, median=1.1642e-10, min=0 at (0, 1, 0), max=4.6566e-10 at (0, 42, 0)

231 is an output of a ReduceMean layer.
And afterwards in a MatMul layer the accumulated diff exceeds 1e-05
[I] Absolute Difference | Stats: mean=8.9681e-06, std-dev=1.4915e-05, var=2.2244e-10, median=3.8147e-06, min=0 at (0, 0, 0, 0), max=0.00048828 at (0, 5, 116, 115)

Steps To Reproduce

my command:
polygraphy run to_onnx/gpt.onnx --model-type onnx --onnxrt --trt -v --input-shapes input:[1,128] seg:[1,128] mask:[1,1,128,128] --int-min 0 --int-max 20000 --float-min -10000 --float-max 0 --val-range input:[0,20000] seg:[1, 2] mask:[-10000.0, 0] --onnx-outputs mark all --trt-outputs mark all --log-file compare.log --strict-types

Hi @1055057679

In my suggestion it’s to not expected to have such high level of matching against ONNXRuntime or any two implementations of a DL model - whether on CPU, GPU, or a mix.
TensorRT provides no way to achieve this.
DL networks are typically robust against changes in the order of FP operations.

But please do let me know if that impacting the accuracy in your case.


Hi, thanks for replying.
Indeed the order of FP operations shouldn’t affect the precision so much.
Now it seems there are some matmul ops after the ReduceMean, which introduce approximately bigger precision loss. I am looking into that problem with the help of your colleague. I will update the result to this post when we find the cause.

1 Like

The problem is finally solved, my pytorch code has some problem , i used a torch.nn.multinomial which output int64 tensor, which I concated with a int32 tensor, and used as input for next loop.
I think it is important for users who are trying to debug to know, that it is common to see a FAILED infomation when comparing trt with onnxruntime using polygraphy. Many layers couldn’t be exactly the same as onnxruntime even using full precision. For example, ReduceMean and Erf. Softmax can only be computed exactly the same using cuDNN, and Matmul can only be computed the same using cuBlas not cuBlasLt. These Ops produce difference between TRT and OnnxRT, but their difference in most cases are acceptable.