Function pointers on device and per thread conditions

Hi all,

I would like to have each thread perform a different activity and I’m not sure what would be the best strategy from a performance point of view.

Let’s say I have 4 threads and there are 4 functions that should be performed, one by each. In normal C on a CPU I would create an array of function pointers and would do something like this:

// let's say we are thread number 2

int tx = 2;

void func0( float arg ) { .......... }

void func1( float arg ) { .......... }

void func2( float arg ) { .......... }

void func3( float arg ) { .......... }

void ( * funcptr[4] )( float arg );

funcptr[0] = &func0;

funcptr[1] = &func1;

funcptr[2] = &func2;

funcptr[3] = &func3;

// now call the function that the thread number defines

funcptr[tx]( 0.5 );

But according to the docs ( device function pointers are not supported so I can’t do the same as above on the GPU. First of all, do I understand the docs correctly?

Then I thought I would go about this as follows

__device__ void func0( float arg ) { .......... }

__device__ void func1( float arg ) { .......... }

__device__ void func2( float arg ) { .......... }

__device__ void func3( float arg ) { .......... }

int tx = threadIdx.x;

switch( tx )


    case 0:

        func0( 0.5 );


    case 1:

        func1( 0.5 );


    case 2:

        func2( 0.5 );


    case 3:

        func3( 0.5 );


But I’m worried that this will reduce performance significantly because of the branching.

Any ideas what would be a good strategy?

If you can switch on blockIdx.x instead of the threadIdx.x, then there will be no performance problem. That will ensure that all threads on a multiprocessor are running the same function, while allowing each multiprocessor do to different things. (That doesn’t help if you want your 4 functions to communicate, though.)

Well, that’s the thing, the 4 functions need to communicate in the sense that after they are done the results should be added up, that’s why they are in the same thread block. But since these 4 functions are not that different (only a couple of plus and minus signs) maybe I can find a mapping from the threadIdx.x to these +/-1’s.

If you really have to do something like this, I would at least attempt to get whole warps doing the same operations if at all possible. You can still have them doing different things within the same thread block, but try hard to keep threads in the same warp doing the same things…


Yeap, I just got to the chapter of the docs that talks about precisely this: :)

I think the best way to formulate complex calculation problems is graphs, where each node is a function, and node inter-connections represent data flow. Then any group of nodes with the same function type can be combined into a CUDA’s warp so that 32 equal functions can be calculated in parallel. Of course, given all data is ready for processing.

Your 4 different functions cannot be grouped, and so they should be running in different warps, preferrably in parallel on different multiprocessors.

Depending on your constraints on precomputation, you could also investigate putting part of your computation in a lookup table…