I am using ONNX Runtime built with TensorRT backend to run inference on an ONNX model. When running the model, I got the following warning: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
The cast down then occurs but the problem is that this is taking a significant amount of time. I also notice that the first inference takes a significant amount of time as well.
It takes ~35X longer to load with network with TRT compared to not using it
It takes ~40X longer to run the first inference with TRT compared to not using it
From thereon, inference is faster by ~20-25% with TRT compared to not using it
I believe the first the top two bullet points are related and have to do with int32 castdown.
Is there something I can do to mitigate this?
Environment
TensorRT Version: 7 CUDA Version: 10.2 CUDNN Version: 7.6 Operating System + Version: Windows 10 64-bit
I went through the process of serializing the model and saving the model to avoid this overhead and then performed inference in C++.
However, upon further inspection of the inference outpus, it seems as though the output I get using TensorRT is quite different than the original output I had in TensorFlow, and even the output I got through ONNXRuntime. I have a regression model where it seems that the TensorRT output distribution has been âsqueezed inâ compared to the TensorFlow output.
I seem to get the same wrong output whether I use TensorRT C++ API, or whether I use OnnxRuntime built with TensorRT.
Could this be due to the casting from int64 to int32, or could this be due to another issue? Are there other build options I can try to see if I can fix this?
Without looking into the model and code, itâs difficult to pin point the reason which might be causing the output mismatch.
Can you try to run ONNX model and compare the output values?
You can also try TF-TRT to build the optimized model, please refer below link for more details:
@SunilJB
I also see the exact same warning message while inferring ONNX model ( exported from PyTorch based Classification model of HRNet architecture type, Opset_version 11) . Also TRT inference classification output is not matching with output of source PyTorch model inference .
[08/28/2020-15:10:27] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
What is probable impact of this warning message? Is this issue addressed already?
Environment
TensorRT Version : TensorRT-7.1.3.4 CUDA Version : CUDA 11.0 CUDNN Version : cudnn-v8.0.2.39 Operating System + Version : Windows 10 64-bit
Tensorrt 7 downcast INT64 to INT32 automatically. But if you have limited memory usage with config->setMaxWorkspaceSize(1 << X) which is less than what is actually required by model to load, it will take too much time to load, even the script might get killed.
You can just disable setMaxWorkspaceSize, if you have used.
Hi,
Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:
validating your model with the below snippet
check_model.py
import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model).
2) Try running your model with trtexec command. https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec
In case you are still facing issue, request you to share the trtexec âââverbose"" log for further debugging
Thanks!
above check model does not return any thing. tested onxx model and it is working fine. attached log file with verbose log enabled as suggested. Getting below output
[W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
terminate called after throwing an instance of âpwgen::PwgenExceptionâ
what(): Driver error:
Aborted (core dumped)
I have also observed a similar performance degradation by using tensorRT execution provider. But this error does not happen while CUDA execution provider. But overall the GPU performance while running ONNX runtime seems slower compared to CPU on a Jetson Xavier. Any insights on this ? Why is it so ? Is there a solution to scale up the performance ?
I have a 3060 pc and i am encountering the same error and same thing , where in the CPUExecution provider is faster than CUDA and even Tensort for using the rembg library ,
initally installed rembg gpu thinking it would be faster , but it is indeed slower , any fixes for this , please help