Hi, I’m building an SDK in which I use multiple engines. When each model is tested alone, the inference time taken by each model is close to the mean time I see using trtexec --loadEngine=<model.engine> --iterations=100. But, when run in the SDK, all the models give a worse performance(sometimes even by 40%!!).
In the SDK, I’m doing the ‘init’ for all the models together(basically loading the engine and creating the context). After that I call the inference for the engine I require. I have 4 models loaded.
Am I doing something wrong or is this the expected behaviour? Is there a better way to do it?
Environment
TensorRT Version: 7.1.3 GPU Type: Jetson NX CUDA Version: 10.2 Operating System + Version: Ubuntu 18.04 LTS Python Version (if applicable): 3.6.9
The way I implement the inference code is very similar to the ONNXMNIST sample. Only the build function is modified to have the context as well. The build function for all the models are called in the INIT() of the sdk and the infer function is called when required.
If there is a better method to use the engines for different models simultaneously then please do tell.
We recommend you to please try latest TensorRT version. If you still face the performance issue, please share us issue repro onnx model and script/steps to try from our end for better help.