Newbie: async kernel, so I can do stuff on the CPU meanwhile, yeah?

dave-lowe · January 13, 2019, 1:09pm

I’m running a kernel in a loop, & expected I’d be able to process results from the prior loop’s kernel launch in parallel, like this pseudocode:

for (i = 0; i < N; i++)
{
    kernel<<<nblocks, nthreads>>>(data[i]); // kernel takes around 2s to complete
    if (i > 0) process(data[i-1]); // disk search on cpu takes about 1s here
    cudaDeviceSynchronize();
}
process(data[N-1]);

I was expecting that to loop at 2s, but it takes 3! The Nsight graph shows a period of GPU inactivity corresponding to that cpu element of the loop. Furthermore, both loop timing & Nsight timeline look similar if I insert the cudaDeviceSynchronize() immediately after the kernel launch.

It’s as if the kernel isn’t running asynchronously… but if I printf clock() timestamps before/after the kernel launch I can see it must be running asynchronouysly. Confused!

tera · January 13, 2019, 1:41pm

Your expectation is correct, however there is a pitfall if you are attempting this on Windows with a GPU using the WDDM driver:

Because kernel launches are expensive in the Windows driver model, the GPU driver tries to batch multiple kernel launches into a single operation. It does that by not launching kernels immediately, but waiting for more kernel launches first.

Insert cudaStreamQuery(0) after the kernel launch and before the CPU code you want to run concurrently with the kernel. This will trigger an immediate kernel launch without further ado.
Or switch to the TCC driver, if your GPU supports it.

See also here:
https://devtalk.nvidia.com/default/topic/533858/about-concurrent-execution-overlap-of-data-transfer-and-kernel-execution-/
https://stackoverflow.com/questions/13568805/cuda-kernels-not-launching-before-cudadevicesynchronize

dave-lowe · January 13, 2019, 2:09pm

Thank you very much, cudaStreamQuery(0) made it work like I expected! I’ll continue to read the links you kindly found.

Topic		Replies	Views
Problem: cuda calls are synchronized CUDA Programming and Performance	17	2992	February 18, 2011
Kernel won't start until cudaDeviceSynchronize() is called CUDA Programming and Performance	1	625	December 17, 2017
Device blocking while evaluating kernel Intended operation? CUDA Programming and Performance	2	1712	September 29, 2011
Kernel execution blocks CPU code CUDA Programming and Performance	9	4076	September 8, 2009
Force the launch without blocking CPU threads CUDA Programming and Performance	2	764	July 24, 2014
Kernel launches/synchronization CUDA Programming and Performance	1	2072	September 28, 2009
is kernel in stream 0 asynchronous? CUDA Programming and Performance	10	3822	April 23, 2011
Cuda Kernel not launching asynchronously? CUDA Programming and Performance	2	1037	June 13, 2012
Waiting for particular kernel CUDA Programming and Performance	1	2756	September 11, 2007
Async Kernel launch cpu seems not getting control after kernel launch CUDA Programming and Performance	7	3287	December 3, 2008

Newbie: async kernel, so I can do stuff on the CPU meanwhile, yeah?

Related topics