Predicated Kernel Launch how to do that?


I have a following logic diagram to execute on gpu:

kernel A (writes to on-gpu variable P)

if (P)

 kernel B


 kernel C

or a simpler case:

kernel A (writes to on-gpu variable P)

if (P)

 kernel B


How do I achieve this behavior without a cpu readback ? I know that directx11 interface has a couple of features which allow to do that - predicated drawing (via gpu-generated queries), and indirect drawing, but I can’t see how would I utilize that in cuda.



One approach would be to use a kernelBC that incorporates the code from both current kernels B and C each encapsulated into device functions. This way, the decision which code to invoke is made inside the kernel based on a device-side variable p, a pointer to which is passed to the kernel. Here is a code sketch:

__device__ void oldKernelB(...);

__device__ void oldKernelC(...);

__global__ void kernelA(someType *pPtr, ...)   // writes p

__global__ void kernelBC(someType *pPtr, ...)  // reads p


    if (*pPtr) {


    } else {




Whether this is workable depends on how similar kernels B and C are. For example do they have identical or similar launch configurations, identical or similar function arguments.

This approach significantly increases register pressure, and obviously doesn’t work with things like cublas and cufft. Reserving a core to spin on gpu kernel completion seems like the only approach you can take… sigh.


Granted this approach does not work with pre-packaged libraries whose functions are designed to be invoked from the host. I wouldn’t necessarily expect a “significant” increase in register pressure, as the registers used for oldKernelB() can be re-used for oldKernelC(). One scenario in which such an increase in register pressure could occur is if the compiler finds many expressions common to oldKernelB() and oldKernel© and hoists them into the common path code preceeding the branch on ‘p’, using up registers to hold the values of these common subexpressions.

I’ve tried before putting 2 kernels into one, and did experience a significant register increase. If I recall correctly 2 kernels were below 30 registers, and about 45-50 when 2 of them combined.

And besides, I’ve presented a very simplified use case just to raise the problem. I’d expect at least 10 kernels appear in if/else clauses.