Loading a new plan file while running inference



My question is: Is it possible to load a new plan file (without using it) to the GPU while continuously running inference?

I’m currently using trtexec sample to execute plan files. Let’s say I have 2 plan files, one for Inception and one for MobileNet. I want to load Inception and do multiple inferences and then in the meantime, load MobileNet plan file to hide the loading latency, so that later, MobileNet is already loaded and can be used right away.


  1. Is it possible to load a new model while another one is busy running inferences (without interrupting it)
  2. If yes, can we hide the latency of loading the new model? Can we load in a way that the latency of the old model is not impacted?



TensorRT Version : 7.1
GPU Type : 512-Core Volta GPU with Tensor Cores
Nvidia Driver Version :
CUDA Version : 10.2
CUDNN Version : 8.0
Operating System + Version : Jetpack 4.4
Python Version (if applicable) : 3.6
TensorFlow Version (if applicable) :
PyTorch Version (if applicable) :
Baremetal or Container (if container which image + tag) :


In order to run multiple model with TensorRT, i will recommend you to either use NVIDIA deepstream or NVIDIA Triton Inference Server.
Please refer below link for more details:


If you want to perform multi threading using TensorRT, please refer below link for best practices: