How to make CUDA calls non blocking ?

Hello,

I was reading the CUDA C best practices guide :

"For example, Listing 3.1 demonstrates how host computation in the routine cpuFunction() is performed while data is transferred to the device and a kernel using the device is executed.

cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, 0);

kernel<<<grid, block>>>(a_d);

cpuFunction();

"

There is apparently nothing to do to have non blocking calls but it doesn’t work for me. That is my code example :

gettimeofday(&tim, NULL);

now = tim.tv_sec * 1000000L + tim.tv_usec;

printf("[%llu]Start Kernel (%d-%d) %p - %p - %p \n",now, blocksPerGrid, threadsPerBlock);

VecAdd<<<blocksPerGrid, threadsPerBlock, 0, stream>>>(A, B, C, N);

gettimeofday(&tim, NULL);

now = tim.tv_sec * 1000000L + tim.tv_usec;

printf("[%llu]End Kernel (%d-%d)\n",now, blocksPerGrid, threadsPerBlock);

What I go at the execution is :

[1326382339228376]Start Kernel (782-256)

[1326382339353622]End Kernel (782-256)

It’s blocking for more than 100 milli seconds, which is a significant amount of time. I’ve the same problem with the cudaMemcpyAsync function : it blocks. I even tried to use stream as showed in the code, but nothing works, all cuda calls are blocking.

Is there a way to get these calls non blocking ? I have a Quadro FX 5800 with a 1.3 compute capability.

Thank you for your help.

Hello,

I was reading the CUDA C best practices guide :

"For example, Listing 3.1 demonstrates how host computation in the routine cpuFunction() is performed while data is transferred to the device and a kernel using the device is executed.

cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, 0);

kernel<<<grid, block>>>(a_d);

cpuFunction();

"

There is apparently nothing to do to have non blocking calls but it doesn’t work for me. That is my code example :

gettimeofday(&tim, NULL);

now = tim.tv_sec * 1000000L + tim.tv_usec;

printf("[%llu]Start Kernel (%d-%d) %p - %p - %p \n",now, blocksPerGrid, threadsPerBlock);

VecAdd<<<blocksPerGrid, threadsPerBlock, 0, stream>>>(A, B, C, N);

gettimeofday(&tim, NULL);

now = tim.tv_sec * 1000000L + tim.tv_usec;

printf("[%llu]End Kernel (%d-%d)\n",now, blocksPerGrid, threadsPerBlock);

What I go at the execution is :

[1326382339228376]Start Kernel (782-256)

[1326382339353622]End Kernel (782-256)

It’s blocking for more than 100 milli seconds, which is a significant amount of time. I’ve the same problem with the cudaMemcpyAsync function : it blocks. I even tried to use stream as showed in the code, but nothing works, all cuda calls are blocking.

Is there a way to get these calls non blocking ? I have a Quadro FX 5800 with a 1.3 compute capability.

Thank you for your help.

Are you using pinned memory as required for cudaMemcpyAsync()?

Are you using pinned memory as required for cudaMemcpyAsync()?

I suspect the 100ms you’re measuring isn’t due to kernel blocking… it’s due to context initialization. That overhead is one-time, not every call.
This would be easy to test by calling any CUDA command that initializes the context first, then timing any later call(s) to the kernels.
Try just calling the kernel 2 times, for example, and measure the time of the second kernel call.

I suspect the 100ms you’re measuring isn’t due to kernel blocking… it’s due to context initialization. That overhead is one-time, not every call.
This would be easy to test by calling any CUDA command that initializes the context first, then timing any later call(s) to the kernels.
Try just calling the kernel 2 times, for example, and measure the time of the second kernel call.

I was mistaken about the memCpyAsync, it’s non-blocking :-)

On the other hand, the kernel is still blocking. The 100 ms are not due to context initialization. This overhead appears, as you said, with the first cuda call (cudaInit, cudaMalloc …) and have occured with my first cudaMemCpy. The overhead is still present at the second kernel call. Maybe this overhead and the fact that the kernell blocks are related, but I don’t understand what and where is the problem :s

I was mistaken about the memCpyAsync, it’s non-blocking :-)

On the other hand, the kernel is still blocking. The 100 ms are not due to context initialization. This overhead appears, as you said, with the first cuda call (cudaInit, cudaMalloc …) and have occured with my first cudaMemCpy. The overhead is still present at the second kernel call. Maybe this overhead and the fact that the kernell blocks are related, but I don’t understand what and where is the problem :s