Description
My C++ application involves concurrently executing model inference over N video streams on a single GPU. The model used is exactly the same for all video streams (a single model.trt file). I have a few inquiries about concurrent inference executions.
In order to execute a TensorRT model, we need to create and initialize a IRuntime, an ICudaEngine and an IExecutionContext. We have two options:
Solution 1: create independantly one triplet (runtime, engine, context) for each video stream, and process video streams independantly. The engine is deserialized N times.
Solution 2: create a IRuntime and a ICudaEngine only once, and then create N IExecutionContext, one for each stream. The engine is deserialized only once.
In solution 1, the TensorRT engine (e.g., model.trt) is deserialized several times. I observed that when more engines of the same model were deseralized, GPU memory usage increased but not at all linearly. For instance, deseralizing one engine consumed about 100 MB, but deseralizing a second one only incremented GPU memory by another 30 MB (observed via nvidia-smi). May I ask if this phenomenon is expected ? Is there perhaps some internal allocation mechanism that efficiently manages repeated engine deseralization ?
In solution 2, we create multiple execution contexts from one deseralized engine. Could you please provide me some insights on such a setup?
For instance, what are some benefits / downsides of a 1-engine-N-contexts setup over N-engines-N-contexts, in terms of memory consumption and speed?
May I call the “execute” or “enqueue” methods of the differents contexts, generated from the same engine, in parrallel without any synchronization (mutex) ?
In addition, is there any limitation on the number of execution context created from a single engine ?
Furthermore, is it the correct / adviced / optimal solution for concurrent inference executions of the same model ?
Finally, we are aware of Triton Server and Deepstream; however, we wonder if handling of multiple streams concurrently can be achieved simply with TensorRT’s C++ API.
Environment
TensorRT Version: 8.5.2.2
GPU Type: GTX 1080
Nvidia Driver Version:
CUDA Version: 11.1
CUDNN Version: 8.2.1
Operating System + Version: Ubuntu 20.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):
Relevant Files
Provide upon request