Batch inference parallelization on tensorrt

I want to execute batch inferences concurrently on the GPU.
I read the 2.3. Streaming paragraph of the following documentation :
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#optimize-performance
I tried to run concurrently 2 batch inference from one thread.
I created a CUDA stream using cudaStreamCreate for each batch and an IExecutionContext for each batch.
The problem is : the kernel executions are intervaling. (like the first and the second stream were waiting each other?)
Is there an obvious reason why ?

NB: I do not use dynamic shapes.

Environment:
TensorRT Version: 7.2
CUDA Version: 11.2
CUDNN Version: 11.2

Hi,
The below link might be useful for you
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#thread-safety
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-priorities
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html
For multi threading/streaming, will suggest you to use Deepstream or TRITON
For more details, we recommend you to raise the query to the Deepstream or TRITON forum.

Thanks!

Thanks for your answer.
“Plugins are shared at the engine level, not the execution context level, and thus plugins which may be used simultaneously by multiple threads need to manage their resources in a thread-safe manner”
From the first link you provided, I understand that maybe the kernels of ExecutionContext::Enqueue function use tensorrt plugins and these plugins shared ressources, and that’s why they could not run simultaneously.

  1. the thread term is ambiguous to me : is it thread from GPU cores or CPU cores ?
  2. is there a way to get the plugins used by tensorrt in enqueue to check that ?

Hi @juliefraysse,

  1. This is referring to CPU threads.
  2. TRT doesn’t use plugins by default - these must be inserted explicitly at the network level, either by the user or a parser. The engine building verbose logs would show whether any plugins are being used or not.

Thank you.

I have the same problem.
Can not infer in parallel anyway, and the GPU utility always less than 40%
How to deal with it?
The detail info is this

Hi @ran980,

We recommend you to follow up in the same git issue to get better help.

Thank you.