TensorRT Caching mechanism not very fast. deserializeCudaEngine takes some time

azeps · September 28, 2017, 8:54am

Hello,

currently I’m testing TensorRT 2.1 on the TX2 but I’m not that happy about loading times.

I’ve taken [url]https://github.com/dusty-nv/jetson-inference/blob/master/tensorNet.cpp[/url] as an example and based my code on it.
Here the developer creates a cache which is loaded if it exists. If not it is created by profiling.
My guess is this is done to save startup time of the program after the first run.

I’ve measured some of the time consuming functions and here are my results for my own net.

The call builder->buildCudaEngine(*network) takes 24.7 seconds.
After that the call infer->deserializeCudaEngine takes 0.003 seconds.

Now if I rerun the program the cache gets loaded. But infer->deserializeCudaEngine now takes 24.5 seconds.

This is extremly confusing. Why did the deserializeCudaEngine after buildCudaEngine takes up so less time?
For my example it doesnt’ matter if I build the model during every startup or not.
FP16 is not used…

For reference I compiled the example as well on my TX2 and added time measurements and disabled FP16.
The call builder->buildCudaEngine(*network) takes 37.3 seconds.
After that the call infer->deserializeCudaEngine takes 0.0851 seconds.
Now if I rerun the program the cache gets loaded. But infer->deserializeCudaEngine now takes 16.4 seconds.

The resulting bvlc_googlenet.caffemodel.2.tensorcache generated during first riun is 27MB big.
My own cache only 41KB as the net is much smaller. I’m confused why my net takes that much time to load.
Is there a way to debug or profile this?
And why is deserializeCudaEngine after buildCudaEngine so incredibly fast? Is there a cache in memory I can’t see as program code?

I can’t find lots of informations about this on this board. So maybe other ones haven’t same issues or arent’ as picky as I am. I only created this topic because I didn’t understand why cached loading is equally slow.

khanhhh89 · August 18, 2021, 10:53am

we are suffering from the same problem here.

When we use trtexec for with --loadEngine option to load the serialized model, the speed is pretty fast.

However, when we load the engine using the same function loadEngine from trtexec but inside our code base, the loading time is super slow, around 10 minutes.

The method deserializeCudaEngine does not print out any information for further debugging.

Our configuration
TensorRT 7.2.2.3, Cuda 11.2.0, our GPU model is A6000.

In contrast, if the same code is run on a different GPU 1080TI, the loading time is not much, pretty fast.

Here is the loading code. the bottleneck is the method deserializeCudaEngine

std::shared_ptr<nvinfer1::ICudaEngine> loadEngine(const std::string &enginePath, int DLACore)
{
    SampleErrorRecorder errRecorder;

    std::ifstream engineFile(enginePath, std::ios::binary);
    if (!engineFile)
    {
        qCritical() << "Error opening engine file: " << QString::fromStdString(enginePath);
        return nullptr;
    }

    engineFile.seekg(0, std::ifstream::end);
    long int fsize = engineFile.tellg();
    engineFile.seekg(0, std::ifstream::beg);


    std::vector<char> engineData(fsize);
    engineFile.read(engineData.data(), fsize);
    if (!engineFile)
    {
        qCritical() << "Error loading engine file: " <<  QString::fromStdString(enginePath);
        return nullptr;
    }

    SampleUniquePtr<IRuntime> runtime{createInferRuntime(gLogger.getTRTLogger())};
    if (DLACore != -1)
    {
        runtime->setDLACore(DLACore);
    }

    runtime->setErrorRecorder(&errRecorder);

    auto engine = runtime->deserializeCudaEngine(engineData.data(), fsize, nullptr);

    return std::shared_ptr<nvinfer1::ICudaEngine>(engine, trt_common::InferDeleter());
}

NVES · August 24, 2021, 6:01pm

Hi,
Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-722/best-practices/index.html#measure-performance
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#model-accuracy

Thanks!

1290401724 · November 26, 2021, 10:12am

hi，
I’m facing this problem with the same configuration.
Deserializing engine costs much long time on RTX3060, RTX3090 or A2000(with TCC mode) than on GTX1080, RTX2080Ti or RTX Titan. This brings a lot of trouble to the real-time system.

Topic		Replies	Views
nvinfer1::ICudaEngine deserializeCudaEngine takes 40-60 sec Jetson TX2	10	2021	October 18, 2021
TensorRT nvinfer1::ICudaEngine deserializeCudaEngine not fast TensorRT tensorrt , deep-learning	7	1442	November 22, 2023
How to make trt engine's initialization faster? TensorRT tensorrt	2	859	May 25, 2021
How to solve "buildCudaEngine" cost long time Jetson TX2	4	1073	October 18, 2021
buidCudaEngine is very slow General	3	1359	October 12, 2021
Different inference time when loading engine from serialized file TensorRT tensorrt	14	1603	November 2, 2021
TensorRT Inference Slower When Loading Searalized Engine than Building on the Fly TensorRT jetson-inference	7	1370	October 12, 2021
[TensorRT] Why is my converted engine faster than the saved engine? GPU-Accelerated Libraries	0	502	February 13, 2018
How to load and deserialize the .engine file? Jetson TX2	2	1399	October 18, 2021
TRT engine - peculiar behaviour Jetson Nano tensorrt , nvbugs	12	1212	October 18, 2021

TensorRT Caching mechanism not very fast. deserializeCudaEngine takes some time

Related topics