I measure on my system that there is an overhead of about 40um when starting a kernel. Is this normal and can I do anything to bring this down?
My kernel takes about 60um to execute, so the overhead is quite large!
I do not do any transfer between kernels, but I enqueue millions of kernels as fast as I can on the CPU. I have tried to do multiple steps of the algorithm on the GPU and this obviously helps a lot, but fails for problems needing more than one work group, since I need “global” synchronization before continuing to the next step.
(the same code take about 100um to execute on the CPU, so the GPU is 2x faster in the actual computation, but due to the overhead of a kernel launch, it comes to a tie in total)
Interesting research. Looks like we can’t profit from simple kernels, because their execution overhead will cover their SIMD advantage. You are not using beta driver, are you ?
I wonder if your #'s will change any if you warm the kernel up first. Meaning do a couple of runs off the clock first. The first run of a kernel seems to run horribly for me.