I have a project which involves an iterative algorithm consisting of lots of small kernels (<50 us) each. I can’t really consolidate these as they are interspersed with FFTs and afaik there is no device callable cuFFT yet. Looking into optimisations via the profiler showed me that it appeared there was a huge delay (300us) from a single memcpy so after some rearranging I had removed that, so now there is no memcpy. Unfortunately that seemed to do absolutely nothing!
Investigating further it seems my kernels are being batched up into blocks of 20 kernels (seem to remember this was to help hide the WDDM overhead), yet since my kernels are so short these 20 kernels execute in less time than it takes the next batch of 20 to be prepared! This results in the GPU sitting idle for 50% of the time. Is there any effective way to lower the time it takes to prepare the next batch of kernels?
Cheers,
Tiomat
tl;dr
Kernels being batched into blocks of 20. Kernels running takes less time than needed to prepare next set of 20 (450us compared to 1000us). Big gaps between the batches. Help!
The CUDA driver has a software queue for WDDM devices to reduce the average overhead of submitting command buffers to the WDDM KMD driver. Only work submitted in the same command buffer can run concurrently. cudaEventQuery(0) can be called to flush the software queue. I do not recommend calling cudaEventQuery after each launch as this will introduce even more overhead.
Thanks for that response, I am updating my version of NSight to the latest (my 3.0 preview I was using did not seem to have that information) and will edit my post when I have re-profiled. From what it appears though the command buffer is 20 deep and once it fills the kernels launch. This is only noticeable as a problem because the time it takes to execute the kernels is less than the time it takes to fill the command buffer causing the GPU to stall.
Hopefully if there is no driver way to avoid this stalling then my prayer for device callable cuFFT will solve all of my problems (plus world hunger, poverty, war, airline food …)
Greg,
So I looked at the profile information you recommended and it does definitely show something regarding the number 20. I hadn’t realised that there was also a queue on the hardware but it looks like the sum of software+hardware has a cap of 20.
This is what my profile shows, and I have shifted some of the bits around to show my problem. As you can see the kernels execute very quickly compared to the time it takes to re-populate the software queue which means the GPU is stalled unable to be utilised fully. This has frustrated me slightly as I have spent a reasonable amount of effort optimising the individual kernels for various devices and have only just realised that a lot of that will actually have had no effect on overall runtime!
I could potentially try to lower the amount of parameters I am passing each individual kernel with some fiddling to try lower the amount of cudaSetupArguments that are being called but it looks like a lot of the time is simply the cudaLaunch timings. 20 Kernels being launched at 25us a pop is half a millisecond (which is longer than the 20 kernels take to execute). There is no memcpy or anything at the end of each 20 kernel block which is forcing the stall, only the time it takes to prepare the next block of 20. The host code is also not doing any noticeable work between kernel launches so there is no appreciable delay there.
Is there anything I can do to try help mitigate this?