Description
I’m running AI inference on video frames. Using RTX 4090 GPU and trt engine for ONNX models. C++. In order to accelerate all this I’m trying to run inference in parallel on one gpu using several cpu threads. In each cpu thread an absolutely new trt context is created which is using it’s own separate cuda stream.
As a result I see if say I launched 3 cpu threads then used gpu resource is tripled as it actually should be. And now we came to a problem. No fps increase is observed. Performance is exactly the same as if I were running just one single trt context.
But if I add another 2 4090 devices and do the same - like 3 trt contexts in parallel on 3 different devices then all is fine. My fps is tripled and all is working as intended.
So the problem is like even if I’m running on one gpu device several independent trt contexts, they are still executing kind of sequentially and not parallel
Environment
TensorRT Version: 8.5.2.2
GPU Type: RTX 4090 but applicable to all GPU types
Nvidia Driver Version: 536.67
CUDA Version: 11.8
CUDNN Version: 8.6.0.163
Operating System + Version: Windows 10-11
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):
Relevant Files
Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)
Steps To Reproduce
Please include:
- Exact steps/commands to build your repro
- Exact steps/commands to run your repro
- Full traceback of errors encountered