I am using ONNX Runtime built with TensorRT backend to run inference on an ONNX model. When running the model, I got the following warning: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
The cast down then occurs but the problem is that this is taking a significant amount of time. I also notice that the first inference takes a significant amount of time as well.
- It takes ~35X longer to load with network with TRT compared to not using it
- It takes ~40X longer to run the first inference with TRT compared to not using it
- From thereon, inference is faster by ~20-25% with TRT compared to not using it
I believe the first the top two bullet points are related and have to do with int32 castdown.
Is there something I can do to mitigate this?
TensorRT Version: 7
CUDA Version: 10.2
CUDNN Version: 7.6
Operating System + Version: Windows 10 64-bit