I want to execute batch inferences concurrently on the GPU.
I read the 2.3. Streaming paragraph of the following documentation : https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#optimize-performance
I tried to run concurrently 2 batch inference from one thread.
I created a CUDA stream using cudaStreamCreate for each batch and an IExecutionContext for each batch.
The problem is : the kernel executions are intervaling. (like the first and the second stream were waiting each other?)
Is there an obvious reason why ?
NB: I do not use dynamic shapes.
Environment:
TensorRT Version: 7.2
CUDA Version: 11.2
CUDNN Version: 11.2
Thanks for your answer.
“Plugins are shared at the engine level, not the execution context level, and thus plugins which may be used simultaneously by multiple threads need to manage their resources in a thread-safe manner”
From the first link you provided, I understand that maybe the kernels of ExecutionContext::Enqueue function use tensorrt plugins and these plugins shared ressources, and that’s why they could not run simultaneously.
the thread term is ambiguous to me : is it thread from GPU cores or CPU cores ?
is there a way to get the plugins used by tensorrt in enqueue to check that ?
TRT doesn’t use plugins by default - these must be inserted explicitly at the network level, either by the user or a parser. The engine building verbose logs would show whether any plugins are being used or not.