How big is the kernel invocation overhead?

Suppose I have to split a kernel into multiple kernels to finish a job, should I be worried about the kernel invocation overhead?I know that it depends on how long each kernel runs, but can anyone give me a estimated number or is there any way to measure it?

The overhead is roughly 10-20 us per kernel call. I measure this by calling an empty kernel and tracking the time needed to do so.

I can confirm this, I get the same overhead.

I have the same problem, I must split a kernel into multiple kernels (if some kind of ‘global’ thread synchronization were available, this would not be necessary… ) I too was worried that the overhead of calling the kernels could affect performance (since I often must call more than one hundred of kernels, per time step, in my simulator) . However, that overhead is constant, regardless of the size of the problem.

Alessandro Tasora

I did some measurements of my own to find the kernel execution and memory transfer invocation overhead. 10uS is pretty standard for an empty kernel and a very small grid size, and you can expect it to rise slightly for large grid sizes.

I have attached a draft I made for recording my results. You can just ignore the parts where I talk about triangle and OBB tests (they are for a collision tester I’m writing)
gpu_profiling.pdf (242 KB)

It’s worth noting that the kernel launch overhead also depends on your operating system. It’s much worse under Windows Vista, for example.

In my experience its often better to just take the kernel launch overhead (which really isn’t much at all) in order to have multiple super efficient kernels (with optimal launch parameters for their task) rather than a monolithic kernel with huge register count. You will see a marked performance boost and it will allow much easier optimization using the profiler.

I think you understand why it is not there, but for the others: The reason there is no global thread synchronization is that, given enough blocks specified in the kernel call, they will not all be running simultaneously, so any synchronization would cause a deadlock.

Got it.Since a block will not spare its place on an SM untill itself has finished execution.

All of you, thanks.

To Fugl:well worth reading it.

What is it under Vista?