Is there an approved way to determine whether kernel execution is complete? We’d like to have the CPU running its part of our algorithm in parallel with the GPU. The asynchronous launch of kernels in CUDA 1.0 is one part of this, but we need something like a non-blocking version of __syncthreads to poll for completion.
It looks like the internal entrypoint __cudaSynchronizeThreads could support this, but is there a better way?