ONNX v TRT Output Mismatch Tolerances

Description

I created a model on PyTorch and exported it as ONNX. When I run this model with ONNX-RT and TRT using random inputs ranged between [0-2] results match with atol=rtol=1e-5. However when I feed real-life data into this model, which has a greater dynamic ranged input data, these tolerances are too tight for a match and I need to bump them up to atol=rtol=1e-3 for a pass from polygraphy test. I tried setting --tactic-sources to cudnn, cublas, cublas_lt seperately and that changed nothing. I was wondering if this behaviour normal, are atol=rtol=1e-3 tolerances feasible ?

Environment

TensorRT Version: 7.2.1.1
GPU Type: RTX3090
Nvidia Driver Version: 460.91.03
CUDA Version: 11.1
CUDNN Version: 8.2.0
Operating System + Version: Ubuntu 18
PyTorch Version (if applicable): 1.8.2

Relevant Files

ONNX file and custom input data onnx_debug - Google Drive

Steps To Reproduce

Failing run;

polygraphy run ./adastec_centerpoint_head.onnx --trt --onnxrt --workspace 1200e06 --load-inputs '/home/adastec/OpenPCDet_Exports/utils/custom_inputs.json' --atol 1e-5 --rtol 1e-5

Passing run;

polygraphy run ./adastec_centerpoint_head.onnx --trt --onnxrt --workspace 1200e06 --load-inputs '/home/adastec/OpenPCDet_Exports/utils/custom_inputs.json' --atol 1e-3 --rtol 1e-3

Hi,

Looks like you’re using an old version of the TensorRT. We recommend you please try on the latest TensorRT version 8.4 EA. Please let us know if you still face this issue.
https://developer.nvidia.com/nvidia-tensorrt-8x-download

Thank you.

I’ll try. But you are acknowledging I should be achieving higher rate of match, right?

I did try with both 8.4.0 EA and 8.2.3. Nothing changed.

Ran layerwise comparison with;
polygraphy run centerpoint_exports/adastec_centerpoint_head.onnx --trt --onnxrt --trt-outputs mark all --onnx-outputs mark all --workspace 1200e06 --load-inputs /home/adastec/OpenPCDet_Exports/utils/custom_inputs.json --fail-fast

It fails from the output of first node which is a ‘Conv’

Hi,

Our team is looking into this. Please allow us some time to work on this.

Thank you.

1 Like

Hi,

This is expected behavior, the absolute and relative error tolerance (limits) specified as default in Polygraphy (1e-5) generally works for most CNNs but not all. Depending on the model and input dynamic range, this may need to be higher (1e-3 is not out of the ordinary). It is ultimately the developer’s call whether this variation is acceptable (for each output). If not, we need to investigate per-layer outputs to identify where the differences are stemming from.

Please check 2. in the following link.

Thank you.

Thanks a lot. I deployed the model using both torchscript and tensorrt. Torchscript was able to match Pytorch code’s outputs with 1e-05=atol=rtol. Both torchscript and onnx->TRT seem to work fine in terms of bounding boxes they predict. But I did not evaluate with metrics like mAP yet.