Kernel won't start until cudaDeviceSynchronize() is called

I’m having lots of problems with getting the CPU and a GTX 980 working in parallell. There’s only a single stream and the program is single threaded so it seems simple enough. But it seems like the kernel won’t actually spawn until I call cudaDeviceSynchronize().

I tested this putting a very long Sleep() between the kernel call and the sync call, and it still spends the exact amount of time inside cudaDeviceSynchronize() as if there was no Sleep() present.

I read somewhere that parts of the Maxwell chips were implemented in software, could this cause such problems?

(Also, I’m connected to the machine via RDP, could that interfere somehow?)

Found a workaround - apparently WDDM tries to batch kernel launches thus delaying them if you send a single one.

Adding a cudaStreamQuery(0) call after launching a kernel forces it to actually launch it.