Asynchronous Stream Synchronize

I am trying to write a simple CUDA scheduler. A daemon program that will constantly run and launch kernels on different streams.

A static version of this scheduler works great. It splits 64 jobs of 16 streams giving each stream 4 jobs. This version of the scheduler runs with 99% efficiency.

But rather than assigning all the work at the beginning, I need to assign jobs on the fly. I would like to listen for one of the 16 jobs to finish and then when it is done have the host thread assign the next job to that stream.

Currently, I am trying to do so through the use of cudaStreamSynchronize() but this check is blocking further jobs from launching and performance decreases drastically.

Is there some Asynchronous Stream Synchronize that I could use instead?

Thank you,

This sounds like a place where you can use CUDA events to check progress in the different streams. See Section in the CUDA C Programming Guide.

Edit: It doesn’t look like the Programming Guide mentions the function cudaEventQuery() which provides a non-blocking way to test the state of a CUDA event, instead of cudaEventSynchronize().

And there also is cudaStreamQuery() to (asynchronously) poll the execution status of streams.