【TensorRT】buildEngineWithConfig too slow in FP16

Description

I want to build engine(mBuilder->buildEngineWithConfig(*mNetwork, *mConfig)) with onnx from the network backbone of yolov4.It works and runs successfully.
But it takes too much time(49mins) to finish the buildEngineWithConfig if I set Flag as FP16(mConfig->setFlag(nvinfer1::BuilderFlag::kFP16)).
Is there a way to speed up?

Environment

TensorRT Version: 7.2.2.3
GPU Type: 3060
Nvidia Driver Version: 471.41
CUDA Version: 11.1
CUDNN Version: 8.0.4.30
Operating System + Version: Windows 10 21H1
Python Version (if applicable): None
TensorFlow Version (if applicable): None
PyTorch Version (if applicable): None
Baremetal or Container (if container which image + tag): None

Relevant Files

if (mRunMode == 1)
{
spdlog::info(“setFp16Mode”);
if (!mBuilder->platformHasFastFp16()) {
spdlog::warn(“the platform do not has fast for fp16”);
}
mBuilder->setFp16Mode(true);
mConfig->setFlag(nvinfer1::BuilderFlag::kFP16);
}
mBuilder->setMaxBatchSize(mBatchSize);
// set the maximum GPU temporary memory which the engine can use at execution time.
mConfig->setMaxWorkspaceSize(10 << 20);
spdlog::info(“fp16 support: {}”,mBuilder->platformHasFastFp16 ());
spdlog::info(“int8 support: {}”,mBuilder->platformHasFastInt8 ());
spdlog::info(“Max batchsize: {}”,mBuilder->getMaxBatchSize());
spdlog::info(“Max workspace size: {}”,mConfig->getMaxWorkspaceSize());
spdlog::info(“Number of DLA core: {}”,mBuilder->getNbDLACores());
spdlog::info(“Max DLA batchsize: {}”,mBuilder->getMaxDLABatchSize());
spdlog::info(“Current use DLA core: {}”,mConfig->getDLACore()); // TODO: set DLA core
spdlog::info(“build engine…”);
mEngine = mBuilder->buildEngineWithConfig(*mNetwork, *mConfig);

Hi,
Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-722/best-practices/index.html#measure-performance
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#model-accuracy

Thanks!

Thank you for your reply.
The size of my onnx model and pth model are 251Mb and I can’t upload my models to this.
Is there another way to share my onnx model and pth model?Google Cloud Drive or Baidu Netdisk?

Hi,

Could you please try on latest TensorRT version 8.2 EA.
If you still face this issue, we recommend you to please share trtexec --verbose logs, issue repro script and onnx model via google drive to try from our end.

Thank you for your reply.
I have tried the latest TensorRT version 8.2EA.It still takes too much time(42mins) to build engine with onnx.
My trtexc below is modified on the basis of the sampleOnnxMNIST.cpp in the sample_onnx_mnist.sln.
test_my_onnx.cpp (12.0 KB)
class_timer.hpp (645 Bytes)
My logtxt:
engine log.txt (3.2 KB)
My google drive of onnx model :