I have a project which involves an iterative algorithm consisting of lots of small kernels (<50 us) each. I can’t really consolidate these as they are interspersed with FFTs and afaik there is no device callable cuFFT yet. Looking into optimisations via the profiler showed me that it appeared there was a huge delay (300us) from a single memcpy so after some rearranging I had removed that, so now there is no memcpy. Unfortunately that seemed to do absolutely nothing!
Investigating further it seems my kernels are being batched up into blocks of 20 kernels (seem to remember this was to help hide the WDDM overhead), yet since my kernels are so short these 20 kernels execute in less time than it takes the next batch of 20 to be prepared! This results in the GPU sitting idle for 50% of the time. Is there any effective way to lower the time it takes to prepare the next batch of kernels?
Kernels being batched into blocks of 20. Kernels running takes less time than needed to prepare next set of 20 (450us compared to 1000us). Big gaps between the batches. Help!