I was reading the CUDA C best practices guide :
"For example, Listing 3.1 demonstrates how host computation in the routine cpuFunction() is performed while data is transferred to the device and a kernel using the device is executed.
cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, 0); kernel<<<grid, block>>>(a_d); cpuFunction();
There is apparently nothing to do to have non blocking calls but it doesn’t work for me. That is my code example :
gettimeofday(&tim, NULL); now = tim.tv_sec * 1000000L + tim.tv_usec; printf("[%llu]Start Kernel (%d-%d) %p - %p - %p \n",now, blocksPerGrid, threadsPerBlock); VecAdd<<<blocksPerGrid, threadsPerBlock, 0, stream>>>(A, B, C, N); gettimeofday(&tim, NULL); now = tim.tv_sec * 1000000L + tim.tv_usec; printf("[%llu]End Kernel (%d-%d)\n",now, blocksPerGrid, threadsPerBlock);
What I go at the execution is :
Start Kernel (782-256) End Kernel (782-256)
It’s blocking for more than 100 milli seconds, which is a significant amount of time. I’ve the same problem with the cudaMemcpyAsync function : it blocks. I even tried to use stream as showed in the code, but nothing works, all cuda calls are blocking.
Is there a way to get these calls non blocking ? I have a Quadro FX 5800 with a 1.3 compute capability.
Thank you for your help.