Does anyone know why this might be? If I comment out some of my shading code (to exclude reflective and transparent materials) and increase block size to 16x16 the CPU roughly equals the GPU time. Why? I am still doing the same amount of memcopies. I would have expected the CPU time to roughly equal the GPU time in the above profiler output?
I’ve tried various scheduling methods (polling & yielding), various thread priorities, etc - and my only conclusion is that the driver or hardware scheduler must be lazy - in the sense that it won’t free an MP the nano-second it completes a kernel, I’m guessing the scheduler has some kind of update frequency or event for freeing MPs, and it simply happens to be quite slow - thus with a high frequency of kernel invocations, especially lots of small/fast kernels, you probably can’t end up occupying the entire card due to the inability of the scheduler to free MPs that have finished executing.
(Purely a random shot in the dark, but nothing else makes sense. at least to me.)