TensorRT Caching mechanism not very fast. deserializeCudaEngine takes some time


currently I’m testing TensorRT 2.1 on the TX2 but I’m not that happy about loading times.

I’ve taken https://github.com/dusty-nv/jetson-inference/blob/master/tensorNet.cpp as an example and based my code on it.
Here the developer creates a cache which is loaded if it exists. If not it is created by profiling.
My guess is this is done to save startup time of the program after the first run.

I’ve measured some of the time consuming functions and here are my results for my own net.

The call builder->buildCudaEngine(*network) takes 24.7 seconds.
After that the call infer->deserializeCudaEngine takes 0.003 seconds.

Now if I rerun the program the cache gets loaded. But infer->deserializeCudaEngine now takes 24.5 seconds.

This is extremly confusing. Why did the deserializeCudaEngine after buildCudaEngine takes up so less time?
For my example it doesnt’ matter if I build the model during every startup or not.
FP16 is not used…

For reference I compiled the example as well on my TX2 and added time measurements and disabled FP16.
The call builder->buildCudaEngine(*network) takes 37.3 seconds.
After that the call infer->deserializeCudaEngine takes 0.0851 seconds.
Now if I rerun the program the cache gets loaded. But infer->deserializeCudaEngine now takes 16.4 seconds.

The resulting bvlc_googlenet.caffemodel.2.tensorcache generated during first riun is 27MB big.
My own cache only 41KB as the net is much smaller. I’m confused why my net takes that much time to load.
Is there a way to debug or profile this?
And why is deserializeCudaEngine after buildCudaEngine so incredibly fast? Is there a cache in memory I can’t see as program code?

I can’t find lots of informations about this on this board. So maybe other ones haven’t same issues or arent’ as picky as I am. I only created this topic because I didn’t understand why cached loading is equally slow.

we are suffering from the same problem here.

When we use trtexec for with --loadEngine option to load the serialized model, the speed is pretty fast.

However, when we load the engine using the same function loadEngine from trtexec but inside our code base, the loading time is super slow, around 10 minutes.

The method deserializeCudaEngine does not print out any information for further debugging.

Our configuration
TensorRT, Cuda 11.2.0, our GPU model is A6000.

In contrast, if the same code is run on a different GPU 1080TI, the loading time is not much, pretty fast.

Here is the loading code. the bottleneck is the method deserializeCudaEngine

std::shared_ptr<nvinfer1::ICudaEngine> loadEngine(const std::string &enginePath, int DLACore)
    SampleErrorRecorder errRecorder;

    std::ifstream engineFile(enginePath, std::ios::binary);
    if (!engineFile)
        qCritical() << "Error opening engine file: " << QString::fromStdString(enginePath);
        return nullptr;

    engineFile.seekg(0, std::ifstream::end);
    long int fsize = engineFile.tellg();
    engineFile.seekg(0, std::ifstream::beg);

    std::vector<char> engineData(fsize);
    engineFile.read(engineData.data(), fsize);
    if (!engineFile)
        qCritical() << "Error loading engine file: " <<  QString::fromStdString(enginePath);
        return nullptr;

    SampleUniquePtr<IRuntime> runtime{createInferRuntime(gLogger.getTRTLogger())};
    if (DLACore != -1)


    auto engine = runtime->deserializeCudaEngine(engineData.data(), fsize, nullptr);

    return std::shared_ptr<nvinfer1::ICudaEngine>(engine, trt_common::InferDeleter());

Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:


I’m facing this problem with the same configuration.
Deserializing engine costs much long time on RTX3060, RTX3090 or A2000(with TCC mode) than on GTX1080, RTX2080Ti or RTX Titan. This brings a lot of trouble to the real-time system.