Pointer to __global__ kernel Question on pointers to kernels posed using the reduction SDK example


The reduction example in the CUDA SDK uses a routine reduce() to pick a kernel and also to launch the kernel. In my situation, I would like to pick a kernel once, and be able to launch the selected kernel several times.

In other words, there would be a routine, say, set_the_kernel(), which would return a pointer to the appropriate global kernel. This returned pointer would be launched several times.

Below is a sketch of what I think the code will look like; my question is whether such an attempt can be realized or is there something inherently wrong with the approach? Also, if the idea is realizable, I would appreciate any critique of the outline; and if there is a working example somewhere, please point me to it.

typedef <class T> 

__global__ void 

pointer_to_reduce_t(T *g_idata, T *g_odata, unsigned int n);

// Replace reduce() by 

template <class T> 


set_the_replace_kernel( int threads,  int blocks,  int whichKernel )





    case 6:


        if (isPow2(size))


            switch (threads)


            case 512:

                return &reduce6<T, 512, true><<< dimGrid, dimBlock, smemSize >>>(T*, T*, int); break;

            case 256:

                return &reduce6<T, 256, true><<< dimGrid, dimBlock, smemSize >>>(T*, T*, int); break;

            case 128:





The usage might look like:

pointer_to_reduce_t  *the_reduce_to_use;

the_reduce_to_use = set_the_replace_kernel( threads,  blocks,  whichKernel );

for( many times )



    // provide data to the kernel

    the_reduce_to_use( n,  d_idata,  d_odata );

    // get the results from the kernel




I think you need to use the driver API to achieve this. Or if you (is your name suggests) are coding for Fermi, you might use function pointers inside the kernel (so you call the same kernel, which then calls a device function through a function pointer.

Your use case seems to suggest it might be easier to use a templated kernel with a switch on the host side. But I assume you want to limit the number of kernels generated as you already template class T as well.

The idea can be realized, as indicated below. The essence of the scheme for implementing the idea involves making use of C++ features – and the good news is that NVIDIA’s nvcc does support the necessary features.

The key advantage of this scheme (over what is in the reduction example) is that the routine with the big-switch, get_me_a_kernel(), is called only once during initialization. During run-time, we call the faster routine run_kernel().

Code fragment

// Suppose CUDA kernel looks like

template <int a, int b>

__global__ void

some_kernel( ...args... )


    // kernel code


// A typedef of a pointer to the above would be:

typedef void (*p_some_kernel_t)( ...args... );

// In the above, note the absence of template parameters

// The pointer is a pointer to a function, and template 

// parameters are not part of the function.  Template

// parameters are more like #define constants or parameters

// known during compilation.  The compiler will create ("instantiate")

// different functions for each of the different sets of template

// parameters it comes across during compilation.

// Declare a variable to hold the pointer and 

// assign a value to it during initialization

p_some_kernel_t   my_choice_of_kernel;

my_choice_of_kernel  =  get_me_a_kernel( value_for_a, value_for_b );

// The routine get_me_a_kernel() would look like

p_some_kernel_t  get_me_a_kernel( int a, int b )


    // Big switch based on a and b

    if( ( a == IMM_A ) && ( b == IMM_B ) )


        // note: here, we have template parameters

        //       but not have routine arguments

        //       In typedef above, it is the other way

        //       around

        return &some_kernel< IMM_A,  IMM_B >;



// Use assigned pointer variable during runtime as follows:

run_kernel( my_choice_of_kernel, threads_per_block, blocks_per_grid, ...args... );

// The routine run_kernel() would be something like:

void run_kernel( p_some_kernel_t foo, int *tpb, int *bpg, ...args.. )


    dim3  tpbv = ( tpb[0], tpb[1], tbp[2] );

    dim3  bpgv = ( bpg[0], bpg[1], bpg[2] );

foo<<< tpbv,  bpgv >>>( ...args... );