Our application is using TensorRT in order to build and deploy deep learning model for specific task. The model must be compiled on the hardware that will be used to run it. However, the application distributed to customers (with any hardware spec) where the model is compiled/built during the installation. Currently, it takes several minutes (specifically 1.5-3 minutes) to build the model using the ONNX parser (this has also been tested with the Caffe parser as well) in 32FP format.
The main question and concern is deployment of multiple deep learning models on PC, which could take long time. Moreover, the 16FP precision format takes up to 40 minutes to compile for single deep learning model. This has been tested on several GPUs including RTX 4000, A5000 and P4000 and all of them are facing the same issues.
Is there a way to solve this problem?
Is it possible to disable optimisation during the deployment stage or to speed up the building step of a deep learning model?
What is the optimal solutution to this problem?
GPU RTX 4000
NVIDIA Driver 472.39
Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.
While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:
this is not related to inference of a model but rather to build time of TRT engine.
For example, if we consider Yolov 5 model using TensorRT , the
buildEngineWithConfig takes long time to compile a trt model. Now, it doesn’t matter if the ONNX or CAFFE parser is used, the build time will be similar.
Honestly, the topic about a long build time has beed discussed many times, for example in this post or here.
Therefore, this question is more oriented on providing a solution or providing some options to avoid (or reduce) model build time. As I previously stated, some applications may require the use of multiple deep learning solutions. Consider that the application employs 10 completely different deep learning solutions that are powered by TensorRT. The build time may varry and it could be from 1-5min depending on the architecture. If customers choose to install the application with those models, it may take a very long time to build all of them on their machine. Customers might avoid to use it because they might see it as an obstacle.
So what would be optimal solution to this issue?
Currently, I don’t see a solution or any other options; rather, I expect the build time to increase with a new TensorRT version 8.0.1. where it’s stated
Engine build times for TensorRT 8.0 may be slower than TensorRT 7.2 due to the engine optimizer being more aggressive.
Currently, we don’t have a real good solution yet, but we can try using the TacticSources feature and disabling cudnn, cublas, and cublasLt. That should speed up the network building.
Also, we can speed up the build by setting the precision of each layer to FP16 and selecting kOBEY_PRECISION, this will disable FP32 layers, but it will fail if there are no FP16 implementations.
Another thing we can do is use the global timing cache, so it will only be slow the first time they build on a per-release basis, but will be faster for each subsequent build.
Also in future release TRT has improvement in build speeds, we suggest moving to new version when it is released.