Tensorrt Threads affect each other during multithreaded inference

Description

According to my understanding of NVIDIA Deep Learning TensorRT Documentation , it should be possible to build tensorRT engines concurrently from multiple threads. However, I did not get the expected resultson the above platform.
In my C++ program, which started multiple threads and initialized the Tensorrt context and cuda stream in each thread, I found in my testing that the time it took to start two threaded models to process a frame (by processing time I mean one of the two models processing a frame) was greater than the time it took to start only one threaded model(Thread models interact with each other,the more threads, the slower the model.).
Deepstream and triton-server are not suitable for my business, so I need to use TensorRT API for integration. I hope you can help me solve this problem.

Environment

TensorRT Version: TensoRT-8.2.5
GPU Type: RTX3080
Nvidia Driver Version: 520.56.06
CUDA Version: cuda11.8
CUDNN Version: 8.6.0
Operating System + Version: centos7.9
Baremetal or Container (if container which image + tag): Baremetal

Relevant Files

This is the simplest test program that contains the full C++ code and model .
Tensort_thead_test1.tar.gz (71.9 MB)
And my onnx model, if you need it
yolov5m.tar.gz (67.5 MB)

Steps To Reproduce

  1. Decompress Tensort_thead_test1.tar.gz and open the file
  2. Open CMakeLists.txt to modify tensort and cuda versions.
  3. Run ./build.sh
  4. Open the build directory and run ./test + (number of model threads)
    For example:
    ./test 1
    ./test 2
    View the average frame rate printed on the terminal

Hi,

The below links might be useful for you.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html

For multi-threading/streaming, will suggest you to use Deepstream or TRITON

For more details, we recommend you raise the query in Deepstream forum.

or

raise the query in Triton Inference Server Github instance issues section.

Thanks!

I have checked a lot of relevant topics, and you all reply like this, but this is not helpful to me. I have read the relevant documents and confirmed that there is no problem in using them. I hope you can use the examples I provided for debugging.

Deepstream or TRITON didn’t fit our business, so we built on Tensort’s api.

Hi,

Could you please try on the latest TensorRT version 8.6 and let us know if you still face the same issue.

Thank you.

I tried TensorRT 8.6.1 and cudn8.9.0 still had the same problem.

hello, you haven’t replied to me for a few days, can you give me any useful help?

I had the same problem. Is there any way to solve it? Where do you think the code call error?

Hi,

Sorry for the delay. Please allow us sometime to try reproducing this issue.

Thank you.

Excuse me, have you checked the result?

Excuse me, have you checked the result?

Hi @jcy.152 ,

Apologies for the delayed response.
We are unable to run the issue repro successfully.
Could you please share the minimal working issue repro.

Thank you.

@spolisetty The example is provided by noblehill above. I tried it, but the problem still remain.Is there any way to solve it? This kind of multithreaded parallelism doesn’t seem performant to me.

In multiple threads, if each thread is using tensorrt , the performance will drop dramatically. It is known for me for 2 years. I cannot believe you guys cannot re-produce it!!

It should be issue of thread. No activities too long, thread lost CPU, There is any activity again, it spends much more time to get the CPU from the beginning.

Do you currently use multithreading for reasoning? How do you handle it?

@spolisetty how to handle this problem?