I want to execute batch inferences concurrently on the GPU.
I read the 2.3. Streaming paragraph of the following documentation :
I tried to run concurrently 2 batch inference from one thread.
I created a CUDA stream using cudaStreamCreate for each batch and an IExecutionContext for each batch.
The problem is : Only few kernels are executed concurently.
Is there an obvious reason why ?
NB: I do not use dynamic shapes.
TensorRT Version: 7.2
CUDA Version: 11.2
CUDNN Version: 11.2