How to make CUDA calls non blocking ?

Elliole · January 12, 2012, 3:49pm

Hello,

I was reading the CUDA C best practices guide :

"For example, Listing 3.1 demonstrates how host computation in the routine cpuFunction() is performed while data is transferred to the device and a kernel using the device is executed.

cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, 0);

kernel<<<grid, block>>>(a_d);

cpuFunction();

"

There is apparently nothing to do to have non blocking calls but it doesn’t work for me. That is my code example :

gettimeofday(&tim, NULL);

now = tim.tv_sec * 1000000L + tim.tv_usec;

printf("[%llu]Start Kernel (%d-%d) %p - %p - %p \n",now, blocksPerGrid, threadsPerBlock);

VecAdd<<<blocksPerGrid, threadsPerBlock, 0, stream>>>(A, B, C, N);

gettimeofday(&tim, NULL);

now = tim.tv_sec * 1000000L + tim.tv_usec;

printf("[%llu]End Kernel (%d-%d)\n",now, blocksPerGrid, threadsPerBlock);

What I go at the execution is :

[1326382339228376]Start Kernel (782-256)

[1326382339353622]End Kernel (782-256)

It’s blocking for more than 100 milli seconds, which is a significant amount of time. I’ve the same problem with the cudaMemcpyAsync function : it blocks. I even tried to use stream as showed in the code, but nothing works, all cuda calls are blocking.

Is there a way to get these calls non blocking ? I have a Quadro FX 5800 with a 1.3 compute capability.

Thank you for your help.

Elliole · January 12, 2012, 3:49pm

Hello,

I was reading the CUDA C best practices guide :

"For example, Listing 3.1 demonstrates how host computation in the routine cpuFunction() is performed while data is transferred to the device and a kernel using the device is executed.

cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, 0);

kernel<<<grid, block>>>(a_d);

cpuFunction();

"

There is apparently nothing to do to have non blocking calls but it doesn’t work for me. That is my code example :

gettimeofday(&tim, NULL);

now = tim.tv_sec * 1000000L + tim.tv_usec;

printf("[%llu]Start Kernel (%d-%d) %p - %p - %p \n",now, blocksPerGrid, threadsPerBlock);

VecAdd<<<blocksPerGrid, threadsPerBlock, 0, stream>>>(A, B, C, N);

gettimeofday(&tim, NULL);

now = tim.tv_sec * 1000000L + tim.tv_usec;

printf("[%llu]End Kernel (%d-%d)\n",now, blocksPerGrid, threadsPerBlock);

What I go at the execution is :

[1326382339228376]Start Kernel (782-256)

[1326382339353622]End Kernel (782-256)

It’s blocking for more than 100 milli seconds, which is a significant amount of time. I’ve the same problem with the cudaMemcpyAsync function : it blocks. I even tried to use stream as showed in the code, but nothing works, all cuda calls are blocking.

Is there a way to get these calls non blocking ? I have a Quadro FX 5800 with a 1.3 compute capability.

Thank you for your help.

tera · January 12, 2012, 6:50pm

Are you using pinned memory as required for cudaMemcpyAsync()?

tera · January 12, 2012, 6:50pm

Are you using pinned memory as required for cudaMemcpyAsync()?

SPWorley · January 13, 2012, 4:22am

I suspect the 100ms you’re measuring isn’t due to kernel blocking… it’s due to context initialization. That overhead is one-time, not every call.
This would be easy to test by calling any CUDA command that initializes the context first, then timing any later call(s) to the kernels.
Try just calling the kernel 2 times, for example, and measure the time of the second kernel call.

SPWorley · January 13, 2012, 4:22am

I suspect the 100ms you’re measuring isn’t due to kernel blocking… it’s due to context initialization. That overhead is one-time, not every call.
This would be easy to test by calling any CUDA command that initializes the context first, then timing any later call(s) to the kernels.
Try just calling the kernel 2 times, for example, and measure the time of the second kernel call.

Elliole · January 13, 2012, 9:38am

I was mistaken about the memCpyAsync, it’s non-blocking :-)

On the other hand, the kernel is still blocking. The 100 ms are not due to context initialization. This overhead appears, as you said, with the first cuda call (cudaInit, cudaMalloc …) and have occured with my first cudaMemCpy. The overhead is still present at the second kernel call. Maybe this overhead and the fact that the kernell blocks are related, but I don’t understand what and where is the problem :s

Elliole · January 13, 2012, 9:38am

I was mistaken about the memCpyAsync, it’s non-blocking :-)

On the other hand, the kernel is still blocking. The 100 ms are not due to context initialization. This overhead appears, as you said, with the first cuda call (cudaInit, cudaMalloc …) and have occured with my first cudaMemCpy. The overhead is still present at the second kernel call. Maybe this overhead and the fact that the kernell blocks are related, but I don’t understand what and where is the problem :s