I am trying to write a simple CUDA scheduler. A daemon program that will constantly run and launch kernels on different streams.
A static version of this scheduler works great. It splits 64 jobs of 16 streams giving each stream 4 jobs. This version of the scheduler runs with 99% efficiency.
But rather than assigning all the work at the beginning, I need to assign jobs on the fly. I would like to listen for one of the 16 jobs to finish and then when it is done have the host thread assign the next job to that stream.
Currently, I am trying to do so through the use of cudaStreamSynchronize() but this check is blocking further jobs from launching and performance decreases drastically.
Is there some Asynchronous Stream Synchronize that I could use instead?