Best way to signal CPU to terminate kernel

Hi,

I have a CUDA kernel which the CPU is calling in a loop multiple times. Now a flag is set in the kernel code when the operation is complete. What is the best way to signal the termination of loop to CPU. If I copy the flag in every iteration back to CPU it wastes a lot of time. Also I can't put the loop in kernel itself because of dependencies.

Will operations like overlapping memory transfer with kernel execution give an advantage in this case. Is there any other way?

Operation is something like this

__device__ int d_flag;

__global__ kernel()

{

     if(d_flag==0)          //terminating condition

     {

          return;

     }

     ....

}

main()

{

while(flag)

     {

           kernel<<<...>>>();

           cudaMemcpy(flag,d_flag,..);

     }

}

Just wanted to know what the best way to tackle this kind of operation is. Thanks.

zero copy is almost certainly the fastest way.

Thanks for replying. However, I will need to have a port for OpenCL as well and as I understand zero copy is not currently supported in OpenCL. Is there any other way? Thanks.