【TensorRT】buildEngineWithConfig too slow in FP16

youyinshirestk · November 16, 2021, 6:03am

Description

I want to build engine(mBuilder->buildEngineWithConfig(*mNetwork, *mConfig)) with onnx from the network backbone of yolov4.It works and runs successfully.
But it takes too much time(49mins) to finish the buildEngineWithConfig if I set Flag as FP16(mConfig->setFlag(nvinfer1::BuilderFlag::kFP16)).
Is there a way to speed up?

Environment

TensorRT Version: 7.2.2.3
GPU Type: 3060
Nvidia Driver Version: 471.41
CUDA Version: 11.1
CUDNN Version: 8.0.4.30
Operating System + Version: Windows 10 21H1
Python Version (if applicable): None
TensorFlow Version (if applicable): None
PyTorch Version (if applicable): None
Baremetal or Container (if container which image + tag): None

Relevant Files

if (mRunMode == 1)
{
spdlog::info(“setFp16Mode”);
if (!mBuilder->platformHasFastFp16()) {
spdlog::warn(“the platform do not has fast for fp16”);
}
mBuilder->setFp16Mode(true);
mConfig->setFlag(nvinfer1::BuilderFlag::kFP16);
}
mBuilder->setMaxBatchSize(mBatchSize);
// set the maximum GPU temporary memory which the engine can use at execution time.
mConfig->setMaxWorkspaceSize(10 << 20);
spdlog::info(“fp16 support: {}”,mBuilder->platformHasFastFp16 ());
spdlog::info(“int8 support: {}”,mBuilder->platformHasFastInt8 ());
spdlog::info(“Max batchsize: {}”,mBuilder->getMaxBatchSize());
spdlog::info(“Max workspace size: {}”,mConfig->getMaxWorkspaceSize());
spdlog::info(“Number of DLA core: {}”,mBuilder->getNbDLACores());
spdlog::info(“Max DLA batchsize: {}”,mBuilder->getMaxDLABatchSize());
spdlog::info(“Current use DLA core: {}”,mConfig->getDLACore()); // TODO: set DLA core
spdlog::info(“build engine…”);
mEngine = mBuilder->buildEngineWithConfig(*mNetwork, *mConfig);

NVES · November 16, 2021, 6:39am

Hi,
Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-722/best-practices/index.html#measure-performance
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#model-accuracy

Thanks!

youyinshirestk · November 16, 2021, 7:17am

Thank you for your reply.
The size of my onnx model and pth model are 251Mb and I can’t upload my models to this.
Is there another way to share my onnx model and pth model?Google Cloud Drive or Baidu Netdisk？

spolisetty · November 19, 2021, 5:37pm

Hi,

Could you please try on latest TensorRT version 8.2 EA.
If you still face this issue, we recommend you to please share trtexec --verbose logs, issue repro script and onnx model via google drive to try from our end.

youyinshirestk · November 23, 2021, 12:37pm

Thank you for your reply.
I have tried the latest TensorRT version 8.2EA.It still takes too much time(42mins) to build engine with onnx.
My trtexc below is modified on the basis of the sampleOnnxMNIST.cpp in the sample_onnx_mnist.sln.
test_my_onnx.cpp (12.0 KB)
class_timer.hpp (645 Bytes)
My logtxt:
engine log.txt (3.2 KB)
My google drive of onnx model :

spolisetty · January 13, 2022, 10:32am

Hi,

We could not reproduce the same issue. We could build engine successfully in less than 5 minutes. Have you tried increasing the workspace ?

youyinshirestk · January 25, 2022, 7:32am

I had tried increasing the workspace to 10G and it didn’t work.Isn’t it enough?

youyinshirestk · January 25, 2022, 8:31am

I just tried to increase the workspace to 40G and it still took 49 minutes to build engine.

spolisetty · January 25, 2022, 9:06am

Hi,

Please allow us sometime to get back on this.

Thank you.

spolisetty · February 2, 2022, 5:44am

Hi,

Currently, we don’t have a real good solution yet, but we can try using the TacticSources feature and disabling cudnn, cublas, and cublasLt. That should speed up the network building.

Also, we can speed up the build by setting the precision of each layer to FP16 and selecting kOBEY_PRECISION, this will disable FP32 layers, but it will fail if there are no FP16 implementations.
Another thing we can do is use the global timing cache, so it will only be slow the first time they build on a per-release basis, but will be faster for each subsequent build.

Also in future release TRT has improvement in build speeds, we suggest moving to new version when it is released.

Thank you.

nhannh312 · March 18, 2022, 1:30am

I got same problem. The situation is hard to analyze.
I built engine from using tensorrt api on RTX 3060 → 5 to 10 mins but on RTX 3080 took over 30 mins.

I try to find the difference in hardware as CPU model but cannt find it out.

nhannh312 · April 5, 2022, 1:46am

This solved my problem.

Topic		Replies	Views
Extreme engine building time for certain models on Windows with FP16 TensorRT	6	1338	March 23, 2022
Building a engine takes too long TensorRT	13	3892	December 8, 2022
Build engine TensorRT on Jetson Nano Jetson Nano tensorrt	5	2173	August 30, 2023
Time of inference in FP16 and FP32 is the same Jetson TX2 tensorrt	19	2130	August 10, 2022
Tensorrt can not speed up well TensorRT	7	1830	June 29, 2022
Building a network takes too long TensorRT	5	1371	May 5, 2021
TensorRT model build time and deployment TensorRT tensorrt , cuda , computer-vision-cv	3	3157	February 2, 2022
Misc Error in transformWeightsIfFP: 1 when using onnx model and run for tensorrt using FP16 DRIVE AGX Xavier General tensorrt , driveos-dl	5	574	April 6, 2021
Trtexec failed to create an engine from onnx file with fp16 TensorRT	7	1373	July 8, 2022
Huge speed difference between engines built from scratch and engines built from onnx TensorRT	9	1218	January 7, 2022

【TensorRT】buildEngineWithConfig too slow in FP16

Description

Environment

Relevant Files

Related topics