My inference code is concurrent and uses different CUDA streams for each inference execution.
Only single inference per stream is guaranteed to exist at each moment.
Is it correct to use just one execution context for multi stream concurrent inference?
And multiple CUDA streams can run in parallel, so for async inference, each thread should have it’s own stream that it’s queuing and synchronizing on.
If you shared a single stream, you would be pipe-lining all of your threads and you wouldn’t get as much of the parallel performance gain.
My current implementation uses single execution context, several Python threads and a dedicated pool of streams ordered in a queue. Each inference is done in the first available/free stream.
I haven’t noticed any issues with thread safety so far.
And extra question: can I use single execution context in an single async inference thread that does inference concurrently in different streams? Only single inference per stream is done at each moment.