Non-blocking kernel execution

Is there an approved way to determine whether kernel execution is complete? We’d like to have the CPU running its part of our algorithm in parallel with the GPU. The asynchronous launch of kernels in CUDA 1.0 is one part of this, but we need something like a non-blocking version of __syncthreads to poll for completion.

It looks like the internal entrypoint __cudaSynchronizeThreads could support this, but is there a better way?

It sounds like what you want is cudaThreadSynchronize(). The specification from the manual:


cudaError_t cudaThreadSynchronize(void);

blocks until the device has completed all preceding requested tasks. cudaThreadSynchronize() returns an error if one of the preceding tasks failed.


This is a host function.

No, we want a function that doesn’t block, just returns whether the kernel execution is complete or not, so the CPU can keep doing useful work.

We are working on extending the API to query the status of a kernel execution. It will be included in a future CUDA release.


That’s good news, we have great need of this for several of our apps too…


John Stone

And what is the estimation? When that new release will be ready?