【TensorRT】buildEngineWithConfig too slow in FP16

Description

I want to build engine(mBuilder->buildEngineWithConfig(*mNetwork, *mConfig)) with onnx from the network backbone of yolov4.It works and runs successfully.
But it takes too much time(49mins) to finish the buildEngineWithConfig if I set Flag as FP16(mConfig->setFlag(nvinfer1::BuilderFlag::kFP16)).
Is there a way to speed up?

Environment

TensorRT Version: 7.2.2.3
GPU Type: 3060
Nvidia Driver Version: 471.41
CUDA Version: 11.1
CUDNN Version: 8.0.4.30
Operating System + Version: Windows 10 21H1
Python Version (if applicable): None
TensorFlow Version (if applicable): None
PyTorch Version (if applicable): None
Baremetal or Container (if container which image + tag): None

Relevant Files

if (mRunMode == 1)
{
spdlog::info(“setFp16Mode”);
if (!mBuilder->platformHasFastFp16()) {
spdlog::warn(“the platform do not has fast for fp16”);
}
mBuilder->setFp16Mode(true);
mConfig->setFlag(nvinfer1::BuilderFlag::kFP16);
}
mBuilder->setMaxBatchSize(mBatchSize);
// set the maximum GPU temporary memory which the engine can use at execution time.
mConfig->setMaxWorkspaceSize(10 << 20);
spdlog::info(“fp16 support: {}”,mBuilder->platformHasFastFp16 ());
spdlog::info(“int8 support: {}”,mBuilder->platformHasFastInt8 ());
spdlog::info(“Max batchsize: {}”,mBuilder->getMaxBatchSize());
spdlog::info(“Max workspace size: {}”,mConfig->getMaxWorkspaceSize());
spdlog::info(“Number of DLA core: {}”,mBuilder->getNbDLACores());
spdlog::info(“Max DLA batchsize: {}”,mBuilder->getMaxDLABatchSize());
spdlog::info(“Current use DLA core: {}”,mConfig->getDLACore()); // TODO: set DLA core
spdlog::info(“build engine…”);
mEngine = mBuilder->buildEngineWithConfig(*mNetwork, *mConfig);

Hi,
Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-722/best-practices/index.html#measure-performance
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#model-accuracy

Thanks!

Thank you for your reply.
The size of my onnx model and pth model are 251Mb and I can’t upload my models to this.
Is there another way to share my onnx model and pth model?Google Cloud Drive or Baidu Netdisk?

Hi,

Could you please try on latest TensorRT version 8.2 EA.
If you still face this issue, we recommend you to please share trtexec --verbose logs, issue repro script and onnx model via google drive to try from our end.

Thank you for your reply.
I have tried the latest TensorRT version 8.2EA.It still takes too much time(42mins) to build engine with onnx.
My trtexc below is modified on the basis of the sampleOnnxMNIST.cpp in the sample_onnx_mnist.sln.
test_my_onnx.cpp (12.0 KB)
class_timer.hpp (645 Bytes)
My logtxt:
engine log.txt (3.2 KB)
My google drive of onnx model :

Hi,

We could not reproduce the same issue. We could build engine successfully in less than 5 minutes. Have you tried increasing the workspace ?

I had tried increasing the workspace to 10G and it didn’t work.Isn’t it enough?

I just tried to increase the workspace to 40G and it still took 49 minutes to build engine.

Hi,

Please allow us sometime to get back on this.

Thank you.

Hi,

Currently, we don’t have a real good solution yet, but we can try using the TacticSources feature and disabling cudnn, cublas, and cublasLt. That should speed up the network building.

Also, we can speed up the build by setting the precision of each layer to FP16 and selecting kOBEY_PRECISION, this will disable FP32 layers, but it will fail if there are no FP16 implementations.
Another thing we can do is use the global timing cache, so it will only be slow the first time they build on a per-release basis, but will be faster for each subsequent build.

Also in future release TRT has improvement in build speeds, we suggest moving to new version when it is released.

Thank you.

I got same problem. The situation is hard to analyze.
I built engine from using tensorrt api on RTX 3060 → 5 to 10 mins but on RTX 3080 took over 30 mins.

I try to find the difference in hardware as CPU model but cannt find it out.

This solved my problem.