I created a model on PyTorch and exported it as ONNX. When I run this model with ONNX-RT and TRT using random inputs ranged between [0-2] results match with atol=rtol=1e-5. However when I feed real-life data into this model, which has a greater dynamic ranged input data, these tolerances are too tight for a match and I need to bump them up to atol=rtol=1e-3 for a pass from polygraphy test. I tried setting --tactic-sources to cudnn, cublas, cublas_lt seperately and that changed nothing. I was wondering if this behaviour normal, are atol=rtol=1e-3 tolerances feasible ?
Environment
TensorRT Version: 7.2.1.1 GPU Type: RTX3090 Nvidia Driver Version: 460.91.03 CUDA Version: 11.1 CUDNN Version: 8.2.0 Operating System + Version: Ubuntu 18 PyTorch Version (if applicable): 1.8.2
Looks like you’re using an old version of the TensorRT. We recommend you please try on the latest TensorRT version 8.4 EA. Please let us know if you still face this issue. https://developer.nvidia.com/nvidia-tensorrt-8x-download
I did try with both 8.4.0 EA and 8.2.3. Nothing changed.
Ran layerwise comparison with; polygraphy run centerpoint_exports/adastec_centerpoint_head.onnx --trt --onnxrt --trt-outputs mark all --onnx-outputs mark all --workspace 1200e06 --load-inputs /home/adastec/OpenPCDet_Exports/utils/custom_inputs.json --fail-fast
It fails from the output of first node which is a ‘Conv’
This is expected behavior, the absolute and relative error tolerance (limits) specified as default in Polygraphy (1e-5) generally works for most CNNs but not all. Depending on the model and input dynamic range, this may need to be higher (1e-3 is not out of the ordinary). It is ultimately the developer’s call whether this variation is acceptable (for each output). If not, we need to investigate per-layer outputs to identify where the differences are stemming from.
Thanks a lot. I deployed the model using both torchscript and tensorrt. Torchscript was able to match Pytorch code’s outputs with 1e-05=atol=rtol. Both torchscript and onnx->TRT seem to work fine in terms of bounding boxes they predict. But I did not evaluate with metrics like mAP yet.