Hi, I am a newbie in CUDA. I have a small application, which requires to copy small data (128 bits) from GPU to CPU many times.
It seems that each cudaMemcpy costs API time, which is more than my kernel time. Is there any way to avoid this?
Should I choose zero copy or other memory techniques?
Thank you very much for any suggestion.
bool cpu_handle(T* data_cpu);
T* data;
cudaMalloc((void **)&data, sizeof(T));
T* data_cpu = (T*)Malloc(sizeof(T));
for (int i=0; i<max_it; i++){
kernel1<<...>>>(..., data)
cudaMemcpy(data_cpu, data, sizeof(T), cudaMemcpyDeviceToHost);
if( cpu_handle(data_cpu) ){
break;
}
}