Newbie: async kernel, so I can do stuff on the CPU meanwhile, yeah?

I’m running a kernel in a loop, & expected I’d be able to process results from the prior loop’s kernel launch in parallel, like this pseudocode:

for (i = 0; i < N; i++)
    kernel<<<nblocks, nthreads>>>(data[i]); // kernel takes around 2s to complete
    if (i > 0) process(data[i-1]); // disk search on cpu takes about 1s here

I was expecting that to loop at 2s, but it takes 3! The Nsight graph shows a period of GPU inactivity corresponding to that cpu element of the loop. Furthermore, both loop timing & Nsight timeline look similar if I insert the cudaDeviceSynchronize() immediately after the kernel launch.

It’s as if the kernel isn’t running asynchronously… but if I printf clock() timestamps before/after the kernel launch I can see it must be running asynchronouysly. Confused!

Your expectation is correct, however there is a pitfall if you are attempting this on Windows with a GPU using the WDDM driver:

Because kernel launches are expensive in the Windows driver model, the GPU driver tries to batch multiple kernel launches into a single operation. It does that by not launching kernels immediately, but waiting for more kernel launches first.

Insert cudaStreamQuery(0) after the kernel launch and before the CPU code you want to run concurrently with the kernel. This will trigger an immediate kernel launch without further ado.
Or switch to the TCC driver, if your GPU supports it.

See also here:

Thank you very much, cudaStreamQuery(0) made it work like I expected! I’ll continue to read the links you kindly found.