Function pointers on device and per thread conditions

nogradi · March 18, 2007, 8:27pm

Hi all,

I would like to have each thread perform a different activity and I’m not sure what would be the best strategy from a performance point of view.

Let’s say I have 4 threads and there are 4 functions that should be performed, one by each. In normal C on a CPU I would create an array of function pointers and would do something like this:

// let's say we are thread number 2

int tx = 2;

void func0( float arg ) { .......... }

void func1( float arg ) { .......... }

void func2( float arg ) { .......... }

void func3( float arg ) { .......... }

void ( * funcptr[4] )( float arg );

funcptr[0] = &func0;

funcptr[1] = &func1;

funcptr[2] = &func2;

funcptr[3] = &func3;

// now call the function that the thread number defines

funcptr[tx]( 0.5 );

But according to the docs (4.2.1.4) device function pointers are not supported so I can’t do the same as above on the GPU. First of all, do I understand the docs correctly?

Then I thought I would go about this as follows

__device__ void func0( float arg ) { .......... }

__device__ void func1( float arg ) { .......... }

__device__ void func2( float arg ) { .......... }

__device__ void func3( float arg ) { .......... }

int tx = threadIdx.x;

switch( tx )

{

    case 0:

        func0( 0.5 );

        break;

    case 1:

        func1( 0.5 );

        break;

    case 2:

        func2( 0.5 );

        break;

    case 3:

        func3( 0.5 );

}

But I’m worried that this will reduce performance significantly because of the branching.

Any ideas what would be a good strategy?

seibert · March 18, 2007, 8:43pm

If you can switch on blockIdx.x instead of the threadIdx.x, then there will be no performance problem. That will ensure that all threads on a multiprocessor are running the same function, while allowing each multiprocessor do to different things. (That doesn’t help if you want your 4 functions to communicate, though.)

Hi all,

I would like to have each thread perform a different activity and I’m not sure what would be the best strategy from a performance point of view.

Let’s say I have 4 threads and there are 4 functions that should be performed, one by each. In normal C on a CPU I would create an array of function pointers and would do something like this:
// let's say we are thread number 2

int tx = 2;

void func0( float arg ) { .......... }

void func1( float arg ) { .......... }

void func2( float arg ) { .......... }

void func3( float arg ) { .......... }

void ( * funcptr[4] )( float arg );

funcptr[0] = &func0;

funcptr[1] = &func1;

funcptr[2] = &func2;

funcptr[3] = &func3;

// now call the function that the thread number defines

funcptr[tx]( 0.5 );
But according to the docs (4.2.1.4) device function pointers are not supported so I can’t do the same as above on the GPU. First of all, do I understand the docs correctly?

Then I thought I would go about this as follows
__device__ void func0( float arg ) { .......... }

__device__ void func1( float arg ) { .......... }

__device__ void func2( float arg ) { .......... }

__device__ void func3( float arg ) { .......... }

int tx = threadIdx.x;

switch( tx )

{

    case 0:

        func0( 0.5 );

        break;

    case 1:

        func1( 0.5 );

        break;

    case 2:

        func2( 0.5 );

        break;

    case 3:

        func3( 0.5 );

}
But I’m worried that this will reduce performance significantly because of the branching.

Any ideas what would be a good strategy?

[snapback]172950[/snapback]

nogradi · March 18, 2007, 8:59pm

Well, that’s the thing, the 4 functions need to communicate in the sense that after they are done the results should be added up, that’s why they are in the same thread block. But since these 4 functions are not that different (only a couple of plus and minus signs) maybe I can find a mapping from the threadIdx.x to these +/-1’s.

tachyon_john · March 18, 2007, 10:03pm

If you really have to do something like this, I would at least attempt to get whole warps doing the same operations if at all possible. You can still have them doing different things within the same thread block, but try hard to keep threads in the same warp doing the same things…

John

nogradi · March 18, 2007, 10:19pm

Yeap, I just got to the chapter of the docs that talks about precisely this: 6.1.1.2 :)

Aleksey_Vaneev · March 19, 2007, 1:28pm

I think the best way to formulate complex calculation problems is graphs, where each node is a function, and node inter-connections represent data flow. Then any group of nodes with the same function type can be combined into a CUDA’s warp so that 32 equal functions can be calculated in parallel. Of course, given all data is ready for processing.

Your 4 different functions cannot be grouped, and so they should be running in different warps, preferrably in parallel on different multiprocessors.

Mark_Harris · March 22, 2007, 12:10pm

Depending on your constraints on precomputation, you could also investigate putting part of your computation in a lookup table…

Mark

Topic		Replies	Views
Can threads in a block execute different device functions in parallel based on a function pointer lo CUDA Programming and Performance	4	879	September 23, 2015
Using thread ID to determine what functions a thread performs CUDA Programming and Performance	0	640	August 12, 2010
Using thread ID to determine what functions a thread performs CUDA Programming and Performance	0	2883	August 12, 2010
Fermi with C++ function pointers CUDA Programming and Performance	0	1048	October 1, 2009
Using threads in device function. CUDA Programming and Performance	0	1054	March 12, 2009
C function in CUDA CUDA Programming and Performance	1	2556	February 18, 2010
Thread specialization optimization CUDA Programming and Performance	5	570	October 5, 2021
Run different kernel functions on different Multiprocessors simultaneously Is it possible to assign CUDA Programming and Performance	3	2967	December 24, 2009
Optimu way for this code? CUDA Programming and Performance	3	1171	March 13, 2009
How can I use __device__ function pointer in CUDA ? CUDA Programming and Performance	34	60747	June 3, 2020

Function pointers on device and per thread conditions

Related topics