This is somewhat maybe a python question, but is there a way to really run multi-threaded python application, each with a different stream so that it would work concurrently with the GPU and I’d get real kernel/memcpy overlapping?
Python obviously has the global interperter lock ( Glossary — Python 3.10.1 documentation) so how does work with CUDA’s streams from python?
Sorry that we don’t have a python sample for this.
Do you meet any issues when implementing this?
If you want to concurrently run memcpy and kernel tasks, please make sure the input buffer is separated.
Or it will be some issue when the kernel tries to access the buffer while it is also being updated into the upcoming input value.
@AastaLLL I’m trying to run two inferencing pipelines (its a blackbox graph with small copies and I have no internal control of the flow there) concurrently via pytorch.
Running one inference pipe on a dedicated stream, shows its running nicely in the profiler.
However the utilization of the GPU seems to be very low.
Running two pipes, on different streams, shows very low concurrency on the GPU, even though they are running in different streams and Python "Threads’… I’m not sure python is able to run it by design because of Python’s GIL (Glossary — Python 3.10.1 documentation)
Is that the case? Is there an example showing how to do this?