Multi Stage/Multiple Kernels/Multiple Passes Invokation Performance Penalty ? Kernel Invocation Perf


(Experience level: noobie Coding stage: planning)

I have an idea which would probably require the invokation of multiple kernels, one (different/unique) kernel for each stage (which can also be thought of as a “pass”).

However I wonder if there is a “kernel invocation performance penalty” ?

The original/first idea was to embed all code into a single kernel, however that might be bad because of “branch divergence”.

The new idea is to analyze the data in stage 1 and “collect and group” all data into “branch groups” so that each “branch group” can be executed independantly by a “branch execution kernel” at stage 2.

So each “divergent” branch would get it’s own kernel. Thanks to compute capability 2.1 all these different kernels could be executed in parallel, or otherwise kernel-serial, but at least the threads inside the block (warp?) would execute in parallel without the branch since it was eliminated at stage 1.

So let’s assume for the sake of discussion that there are 10 different kernels (stages) which all have to be executed in serial because of compute 1.0 limitation or so.

Also let’s assume all data has been allocated and no further memory allocations have to be done, all data is inside the gpu’s main memory system so no cpu-to-gpu transfers needed ?!?

My question is the following:

Suppose the following code is executed:

Kernel 1
Kernel 2
Kernel 3
Kernel 4
Kernel 10

Is there a performanc penalty for invoking a kernel ?

If so how much of a performance penalty ? How many cycles do it take for a kernel to start executing ?

In other words:

If the “branch/thread serialization” of a warp/all threads in a block is faster than invoking a kernel then it (the new idea) might not be worth it and everything should simply be stuffed into a single kernel and hope for the best ;) (get data/non-divergance lucky).