Load TensorRT engine and deserialize in C++

Where can I see C++ sample to load TensorRT engine and deserialize for inference in C++?

This is in Python and I’m looking for C++ version.

  1. with open(“sample.engine”, “wb”) as f: f.write(engine.serialize())

  2. Read the engine from the file and deserialize:

with open(“sample.engine”, “rb”) as f, trt.Runtime(TRT_LOGGER) as runtime: engine = runtime.deserialize_cuda_engine(f.read())

1 Like

Hi @edit_or,

Kindly refer to the below links


It loads UFF model and create engine.
For me I like to load TensorRT engine file (detect.engine) directly in C++.
Because TensorRT engine is created using the same system, so I don’t need to rebuild. I can directly use TensorRT engine.
How to load and deserialize in C++?

Hi @edit_or
You can use trtexec command to load the engine
trtexec --loadEngine=g1.trt --batch=1


Hello AakankshaS,

I don’t think this really answer the problemtic we have here.

I am facing the same issue as edit_or and the point is above loading in C++ (loadingfrom terminal is working) a .engine model.
I couldn’t find an answer to the problem but many supposition such as workign with the deepstream sample (nvdsinfer_custom_impl_yolo), modifying source. But still, loading this to a custom C++ apps remains very mysterious.

I am not sure of what the .engine file really aim at and documentation seems poor on this topic. From your answer does it means we are suppose to include and eventually modify the source of trtexec in our app to run a model ? Isn’t there some sort of c++ interface/lib for doing so ?

Thank for your attention

1 Like

It seems amazing to me that Nvidia is always trying to empower developers to do things with its libraries and frameworks but its literally impossible just to know how to load an .engine model and perform inference with it (in simple plain C++ code).

This is all they show (barely 7, out of context, lines). So you have to dive into Dustin Franklin Jetson-Inference code for hours just to understand a little bit how does this works.

Could you please show a simple and explained script on how to work with .engine files to perform inference?

I think this will help us to build better solutions using Nvidia frameworks. :)
Nice Regards.

Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:


@matesanz.cuadrado did you have found any solution?
I am also facing the same problem .

No, unfortunately docs are still far away from what one would expect from a serious product.

If you figure it out, please, let me know.

1 Like

Does this github helpful for you?

Hello here you can find the code that got it working for me

#include <fstream> 
#include <sstream>
#include <NvInfer.h>
#include <NvInferPlugin.h>
#include <NvInferPluginUtils.h>
#include <NvInferRuntime.h>
#include <NvInferRuntimeCommon.h>

#include "logger.h"

struct TRTDestroy {
    template<class T> 
    void operator()(T* obj) const {
class Logger : public ILogger {
    void log(Severity severity, const char* msg) override {
        if(severity != Severity::kINFO) {
            std::cout << msg << std::endl;

template< class T >
using TRTUniquePtr = std::unique_ptr< T, TRTDestroy >;

std::ifstream planfile(<path_to_engine_file>)
std::stringstream planBuffer;
planBuffer << planFile.rdbuf();
std::string plan = planBuffer.str();

TRTUniquePtr< nvinfer1::IRuntime > runtime {nullptr};
TRTUniquePtr< nvinfer1::ICudaEngine > engine {nullptr};
TRTUniquePtr< nvinfer1::IExecutionContext()> context {nullptr};

engine.reset(runtime->deserializeCudaEngine((void*) plan.data(), plan.size(), nullptr));
#pragma once

#include <iostream>

class logger : public ILogger {
    void log(Severity severity, const char * msg) override {
        if (severity != Severity::kInfo) {
            std::cout << msg << std::endl;
} gLogger;

// one of the logger is probably not usefull since overrided but I didn't experimented without it so here it is 

from there you can start to populate your input (cudaMalloc / cudaMemCpy) and set the pointers in a std::vector<void*> for exemple

and request execution with for exemple

context->enqueue(batch_size, buffers.data(), cuda_stream /*0 for exemple */, nullptr);

with buffers the std::vector<void*> with pointers to the input in gpuMemory (from cudaMallloc)

I can’t remember where I found this, so sorry for the credit.

I may details a little more the second party later.
good luck

1 Like

Nice that helps