TensorRT model build time and deployment

ND_satnik · December 17, 2021, 10:47am

Hello,

Our application is using TensorRT in order to build and deploy deep learning model for specific task. The model must be compiled on the hardware that will be used to run it. However, the application distributed to customers (with any hardware spec) where the model is compiled/built during the installation. Currently, it takes several minutes (specifically 1.5-3 minutes) to build the model using the ONNX parser (this has also been tested with the Caffe parser as well) in 32FP format.

The main question and concern is deployment of multiple deep learning models on PC, which could take long time. Moreover, the 16FP precision format takes up to 40 minutes to compile for single deep learning model. This has been tested on several GPUs including RTX 4000, A5000 and P4000 and all of them are facing the same issues.

Is there a way to solve this problem?
Is it possible to disable optimisation during the deployment stage or to speed up the building step of a deep learning model?
What is the optimal solutution to this problem?

Spec:
TensorRT 7.2.3.1
GPU RTX 4000
NVIDIA Driver 472.39
CUDA 11.1
CUDNN 8.2
Windows 10

Best regards,
Andrej

NVES · December 17, 2021, 11:08am

Hi,
Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-722/best-practices/index.html#measure-performance
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#model-accuracy

Thanks!

ND_satnik · February 1, 2022, 6:05pm

Hello again,
this is not related to inference of a model but rather to build time of TRT engine.

For example, if we consider Yolov 5 model using TensorRT , the buildEngineWithConfig takes long time to compile a trt model. Now, it doesn’t matter if the ONNX or CAFFE parser is used, the build time will be similar.
Honestly, the topic about a long build time has beed discussed many times, for example in this post or here.

Therefore, this question is more oriented on providing a solution or providing some options to avoid (or reduce) model build time. As I previously stated, some applications may require the use of multiple deep learning solutions. Consider that the application employs 10 completely different deep learning solutions that are powered by TensorRT. The build time may varry and it could be from 1-5min depending on the architecture. If customers choose to install the application with those models, it may take a very long time to build all of them on their machine. Customers might avoid to use it because they might see it as an obstacle.

So what would be optimal solution to this issue?

Currently, I don’t see a solution or any other options; rather, I expect the build time to increase with a new TensorRT version 8.0.1. where it’s stated

Engine build times for TensorRT 8.0 may be slower than TensorRT 7.2 due to the engine optimizer being more aggressive.

Best regards,
Andrej

spolisetty · February 2, 2022, 5:44am

Hi,

Currently, we don’t have a real good solution yet, but we can try using the TacticSources feature and disabling cudnn, cublas, and cublasLt. That should speed up the network building.

Also, we can speed up the build by setting the precision of each layer to FP16 and selecting kOBEY_PRECISION, this will disable FP32 layers, but it will fail if there are no FP16 implementations.
Another thing we can do is use the global timing cache, so it will only be slow the first time they build on a per-release basis, but will be faster for each subsequent build.

Also in future release TRT has improvement in build speeds, we suggest moving to new version when it is released.

Thank you.

Topic		Replies	Views
Extreme engine building time for certain models on Windows with FP16 TensorRT	6	1180	March 23, 2022
ONNX Model Int64 Weights TensorRT	12	12724	February 17, 2024
TensorRT inference take too much time than expected TensorRT tensorrt	2	1021	December 22, 2020
How can I optimize multi-batch and parallel inference in TensorRT for faster performance on high-resolution image patches? TensorRT tensorrt , cuda , ubuntu , python , cudnn , deep-learning	2	55	December 2, 2024
TensorRT Engine Creation Methods’ Differences TensorRT tensorrt	1	412	September 27, 2023
【TensorRT】buildEngineWithConfig too slow in FP16 TensorRT tensorrt	11	3707	April 5, 2022
TensorRT gives diffent results than ONNX and Pytorch TensorRT	8	1465	September 28, 2023
Build TensorRT on Cuda compute capability 7.5 and make it backward compatible with previous capabilities TensorRT tensorrt	4	1741	May 19, 2022
Building a engine takes too long TensorRT	13	2974	December 8, 2022
Error occurred while running the Tensorrt samples: [reformat.cpp::executeCutensor::385] TensorRT tensorrt	3	1167	December 12, 2023

TensorRT model build time and deployment

Related topics