Device blocking while evaluating kernel Intended operation?

I’ve got a kernel that takes ~3 seconds to evaluate to completion. It appears that when I launch the kernel, it does not evaluate until I call cudaDeviceSynchronize() and locks the display driver until it completes. This required disabling TDR; otherwise Windows would assume the device was hung and would restart it mid-CUDA evaluation and interrupt the kernel.

For example:

my_kernel<<<dimGrid, dimBlock>>>(param a, param b);

cudaDeviceSynchronize();

Evaluates successfully, but freezes the entire display driver for 3s until it is completed.

However, this example:

my_kernel<<<dimGrid, dimBlock>>>(param a, param b);

sleep(5000);

cudaDeviceSynchronize();

Evaluates successfully, but puts the process to sleep for 5s then freezes the display driver for 3s. I was under the impression that kernel launches were asynchronous and launched at the point where you call <<<>>>, which would imply that the display driver would be able to context switch between evaluating the CUDA calls and updating the frame buffer. Is this not the case? Disclaimer: I’m quite new to CUDA and most likely doing something wrong

Setup:

Card - GeForce 580m

SDK - 4.0

Compiled for - compute_20;sm_21

Driver - Latest (as of Sunday)

Platform - x64

They aren’t exactly asynchronous on WDDM (standard Win7 display devices) yet. Basically, launching a kernel on WDDM is expensive because we have to hand it off to the kernel, which then hands it to the device. This makes launching a kernel 10x slower on WDDM than on other drivers. What we do to amortize that a bit is wait for a user to launch a batch of work and then push all that to the kernel at once (as soon as there’s “enough” work queued up). However, as you’ve seen, if you launch only one kernel, it won’t launch until you try to synchronize with the GPU because we’re waiting for more work. (Someday, we’ll have a better way to do this, but it hasn’t quite been finished yet…)

Easy way to work around this: if you’re sure you want to launch work at this exact moment (to guarantee CPU/GPU overlap on WDDM or something like that), toss in a cudaStreamQuery(0). This will flush all pending work. You don’t need to do this on other operating systems, and in general the only time you need to do it right now is if you’re absolutely sure you want a kernel to launch immediately for CPU/GPU overlap. And yes, we will make this better (launch after a timeout) in a future driver release.

Perfect explanation! Thanks for the follow-up, I’ll give the work-around a shot tomorrow.