Kernel enqueue overhead Bringing kernel overhead down?


I measure on my system that there is an overhead of about 40um when starting a kernel. Is this normal and can I do anything to bring this down?
My kernel takes about 60um to execute, so the overhead is quite large!

I do not do any transfer between kernels, but I enqueue millions of kernels as fast as I can on the CPU. I have tried to do multiple steps of the algorithm on the GPU and this obviously helps a lot, but fails for problems needing more than one work group, since I need “global” synchronization before continuing to the next step.

(the same code take about 100um to execute on the CPU, so the GPU is 2x faster in the actual computation, but due to the overhead of a kernel launch, it comes to a tie in total)

Best Regards,

imho it’s the time to build a program with source, check if you can build it with binary

But my build command is executed outside the timing loop?

how did you measure a kernel overhead execution and kernel execution itself then

I measured the kernel execution time using the profiler option on the execution queue. This gave

Execution time (only kernel launch loop): 13.44s (c++ timer)

Kernel Execution Time: 7.30474s total, 7.30467e-05s pr. kernel launch

The I turn off the profiling and measure the execution time:

Execution time (only kernel launch loop): 11.43s (c++ timer)

Since 100001 kernels were executed, the overhead of launching the kernels must be

(11.43-7.30)s/100001 = 41us pr. kernel launch

Clearly the kernel launch takes a significant amount of time, the profiling give wrong results, or the CPU can’t feed the queue fast enough?

Launching 100001 “empty” kernels give (with profiling):

Kernel Execution Time: 0.45063s total, 4.50626e-06s pr. kernel launch

And without profiling:

Execution time (only kernel launch loop): 4.56s (c++ timer)

Interesting research. Looks like we can’t profit from simple kernels, because their execution overhead will cover their SIMD advantage. You are not using beta driver, are you ?

I use this:

(II) NVIDIA GLX Module 195.17 Mon Oct 26 08:26:05 PST 2009

I think it is a beta. Would the latest non-beta “190.53 Certified” be better? There seem to be a newer beta called 195.30, which I could try.

I just found out, that if I remove the 7 image2d_t arguments I have for the kernel, the latency will be cut in half!

the last 195.36.08, which will be available soon, doubles the performance for me

I wonder if your #'s will change any if you warm the kernel up first. Meaning do a couple of runs off the clock first. The first run of a kernel seems to run horribly for me.

The numbers I posted was the average kernel execution time over 100001 executions.