Is WDDM causing this?

Greg,
So I looked at the profile information you recommended and it does definitely show something regarding the number 20. I hadn’t realised that there was also a queue on the hardware but it looks like the sum of software+hardware has a cap of 20.

This is what my profile shows, and I have shifted some of the bits around to show my problem. As you can see the kernels execute very quickly compared to the time it takes to re-populate the software queue which means the GPU is stalled unable to be utilised fully. This has frustrated me slightly as I have spent a reasonable amount of effort optimising the individual kernels for various devices and have only just realised that a lot of that will actually have had no effect on overall runtime!

I could potentially try to lower the amount of parameters I am passing each individual kernel with some fiddling to try lower the amount of cudaSetupArguments that are being called but it looks like a lot of the time is simply the cudaLaunch timings. 20 Kernels being launched at 25us a pop is half a millisecond (which is longer than the 20 kernels take to execute). There is no memcpy or anything at the end of each 20 kernel block which is forcing the stall, only the time it takes to prepare the next block of 20. The host code is also not doing any noticeable work between kernel launches so there is no appreciable delay there.

Is there anything I can do to try help mitigate this?