TensorRT Concurrent inference in C++

Description

My C++ application involves concurrently executing model inference over N video streams on a single GPU. The model used is exactly the same for all video streams (a single model.trt file). I have a few inquiries about concurrent inference executions.

In order to execute a TensorRT model, we need to create and initialize a IRuntime, an ICudaEngine and an IExecutionContext. We have two options:

Solution 1: create independantly one triplet (runtime, engine, context) for each video stream, and process video streams independantly. The engine is deserialized N times.
Solution 2: create a IRuntime and a ICudaEngine only once, and then create N IExecutionContext, one for each stream. The engine is deserialized only once.

In solution 1, the TensorRT engine (e.g., model.trt) is deserialized several times. I observed that when more engines of the same model were deseralized, GPU memory usage increased but not at all linearly. For instance, deseralizing one engine consumed about 100 MB, but deseralizing a second one only incremented GPU memory by another 30 MB (observed via nvidia-smi). May I ask if this phenomenon is expected ? Is there perhaps some internal allocation mechanism that efficiently manages repeated engine deseralization ?

In solution 2, we create multiple execution contexts from one deseralized engine. Could you please provide me some insights on such a setup?
For instance, what are some benefits / downsides of a 1-engine-N-contexts setup over N-engines-N-contexts, in terms of memory consumption and speed?

May I call the “execute” or “enqueue” methods of the differents contexts, generated from the same engine, in parrallel without any synchronization (mutex) ?

In addition, is there any limitation on the number of execution context created from a single engine ?

Furthermore, is it the correct / adviced / optimal solution for concurrent inference executions of the same model ?

Finally, we are aware of Triton Server and Deepstream; however, we wonder if handling of multiple streams concurrently can be achieved simply with TensorRT’s C++ API.

Environment

TensorRT Version: 8.5.2.2
GPU Type: GTX 1080
Nvidia Driver Version:
CUDA Version: 11.1
CUDNN Version: 8.2.1
Operating System + Version: Ubuntu 20.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Provide upon request

Hi,

Just wanted to follow up and hope to hear your feedback on the subject. Thank you in advance!

Hi @alphadadajuju2 ,
(1) - repeated initialization of TRT engines costs the same amount of memory each time. However, there is stuff going on in CUDA when a library is loaded (i.e. the device code is loaded onto the GPU) which happens one time, typically on first call into the the library, and this is what you may be observing.
(2) I am afraid, we dont have a sample, However, this is how TRT is designed to be used - single engine, multiple contexts. There is no perf downside of this, and usually a considerable memory upside.There is no inbuilt limit on the number of execution contexts for an engine.No synchronization is required between execution contexts. See the developer guide for (slightly) more detail.
Thanks

Thank you for the explanations on our two queries!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.