How to copy small data from GPU to CPU many times efficiently?

Hi, I am a newbie in CUDA. I have a small application, which requires to copy small data (128 bits) from GPU to CPU many times.

It seems that each cudaMemcpy costs API time, which is more than my kernel time. Is there any way to avoid this?

Should I choose zero copy or other memory techniques?

Thank you very much for any suggestion.

bool cpu_handle(T* data_cpu); 

T* data;
cudaMalloc((void **)&data, sizeof(T));

T* data_cpu = (T*)Malloc(sizeof(T));

for (int i=0; i<max_it; i++){
    kernel1<<...>>>(..., data)

    cudaMemcpy(data_cpu, data, sizeof(T), cudaMemcpyDeviceToHost);

   if( cpu_handle(data_cpu) ){

If your kernel executes faster than the overhead of a single cudaMemcpy call, you may not be efficiently utilizing the GPU. You might want to investigate trying to do more work or move more of your algorithm onto the GPU. You haven’t really provided enough of an outline to make a well-formed proposal, but as a simple example, move the for-loop and the cpu_handle test onto the GPU. Call the GPU kernel once, and have it return the desired result.