Concurrent inference in a single IExecutionContext

I’m using torch2rt to convert my Torch model to TensorRT. TRTModule creates single execution context (IExecutionContext) for the runtime engine (https://github.com/NVIDIA-AI-IOT/torch2trt/blob/master/torch2trt/torch2trt.py#L333)

My inference code is concurrent and uses different CUDA streams for each inference execution.
Only single inference per stream is guaranteed to exist at each moment.

Is it correct to use just one execution context for multi stream concurrent inference?

Hi,

I think each thread should have it’s own execution context during inference and it’s own stream if doing asynchronous inference.

The TensorRT best practices doc explicitly states that each thread should have it’s own execution context:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#thread-safety

And multiple CUDA streams can run in parallel, so for async inference, each thread should have it’s own stream that it’s queuing and synchronizing on.
​​​​​​​If you shared a single stream, you would be pipe-lining all of your threads and you wouldn’t get as much of the parallel performance gain.

Thanks

1 Like

Thank you for the link!

My current implementation uses single execution context, several Python threads and a dedicated pool of streams ordered in a queue. Each inference is done in the first available/free stream.

I haven’t noticed any issues with thread safety so far.

And extra question: can I use single execution context in an single async inference thread that does inference concurrently in different streams? Only single inference per stream is done at each moment.