Returning a value from a kernel

I have an algorithm where a set of kernels need to be invoked repeatedly until a termination criterion is satisfied. The question is, how can I efficiently identify when that has happened? What I really want is to be able to return a value from a kernel, but that isn’t allowed. I could write the value to global memory and then do a cupaMemcpy() after every iteration, but that adds a lot of overhead just to return a single boolean flag. Is there some way to do it that’s more lightweight? In OpenGL programming, people generally use an occlusion query for this purpose.


I don’t believe there is a way around this. I believe zero copy is the most efficient way to accomplish this. Check out this thread, it looks like a very similar question to yours. tmurray does state that zero-copy is the solution…

It also looks like there is some interest in this feature here:

Thanks! That thread was discussing exactly the same problem I have.

No one in that thread actually posted any comparisons of the different approaches that were suggested (cudaMemcpy, cudaMemcpyAsync, zero-copy), so I’ll try them out and see how they compare.