Performance discrepancy using TensorRT engines


Hi, I’m building an SDK in which I use multiple engines. When each model is tested alone, the inference time taken by each model is close to the mean time I see using trtexec --loadEngine=<model.engine> --iterations=100. But, when run in the SDK, all the models give a worse performance(sometimes even by 40%!!).
In the SDK, I’m doing the ‘init’ for all the models together(basically loading the engine and creating the context). After that I call the inference for the engine I require. I have 4 models loaded.

Am I doing something wrong or is this the expected behaviour? Is there a better way to do it?


TensorRT Version: 7.1.3
GPU Type: Jetson NX
CUDA Version: 10.2
Operating System + Version: Ubuntu 18.04 LTS
Python Version (if applicable): 3.6.9

Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:


The way I implement the inference code is very similar to the ONNXMNIST sample. Only the build function is modified to have the context as well. The build function for all the models are called in the INIT() of the sdk and the infer function is called when required.
If there is a better method to use the engines for different models simultaneously then please do tell.

bool SampleInference::build()
    std::vector<char> trtModelStream_;
    size_t size{ 0 };

    std::ifstream file("/media/31A079936F39FBF9/romil/onnx_cache_trt/midas_384_new_folded_questionmark.trt", std::ios::binary);

    if (file.good())
        file.seekg(0, file.end);
        size = file.tellg();
        file.seekg(0, file.beg);
        trtModelStream_.resize(size);, size);

    IRuntime* runtime = createInferRuntime(sample::gLogger);
    mEngine_hq = std::shared_ptr<nvinfer1::ICudaEngine>(runtime->deserializeCudaEngine(, size, nullptr), samplesCommon::InferDeleter());
    if (!mEngine_hq)
        return false;

    context_iExecutionContext = (mEngine_hq->createExecutionContext());
    context_hq = SampleUniquePtr<nvinfer1::IExecutionContext>(context_iExecutionContext);
    nvinfer1::Dims4 input_dimensions(BATCH,3,384,1120);
    //int binding_index = nvinfer1::iCudaEngine::getBindingIndex("INPUTS"); 
    return true;


We recommend you to please try latest TensorRT version. If you still face the performance issue, please share us issue repro onnx model and script/steps to try from our end for better help.

Thank you.