How can I use device function pointer in CUDA ?

CoderFelix · January 19, 2010, 6:36am

Dear all,

How can I use device function pointer in CUDA ?

I don’t want to use a batch of “switch - case” to handle the device function selection in my program since the number of func is large.

But it seems device function has no pointer in CUDA. Any feasible solution ?

Thanks in advance!

Sarnath · January 19, 2010, 7:23am

No Solution…

But you could hide the “swith-case” itself inside a “device” function and call it in appropriate place :-)

avidday · January 19, 2010, 7:33am

You can’t. At the moment, device functions are in-line expanded by the compiler. They don’t actually exist as code objects and have no call stack or physical address.

CoderFelix · January 19, 2010, 2:15pm

Thanks! You guys are correct about the inline…

It seems I have to use “switch -case” to deal with it… External Image

Exhodus · December 17, 2010, 2:37pm

Hi

I have a quite similar problem, I would like to “link” a device function from the host code to the kernel, I guess it wont work. Any ideas for a workaround? What im trying to achieve is a c++ framework where you can easily parameter and run a complicated kernel (a parallel genetic algorithm acutally) but there is one funcion which is problem specific (the fitness funcition) and i would like to be able to link it from the same file, where I set the parameters. Multiple kernels wouldnt be nice…

Is there any possible solution for this?

Well to put it simply, i would like to acutally inject some code to the kernel via the host function, any possibilities for that?

SPWorley · December 17, 2010, 8:47pm

If you know the arguments at compile time, you could use template instantiation with functors to define your kernel, parameterized by the template args.

I’ve done this many times to make multiple variants of kernels without having to cut and paste code together. Stuff like a base algorithm which takes an optional preprocess and postprocess functor. The postprocess may be a no-op, or maybe it does a test in the kernel so that only “high scoring” values are returned. The preprocess may be a no-op, or maybe it filters out possible candidates, etc.

It’s not as versatile as function pointers of course, but for a lot of hacks it’s really cool.

seibert · December 18, 2010, 12:36am

I’m pretty sure for my next project, I’ll be dynamically generating and compiling CUDA code on the fly to solve this problem. PyCUDA makes it very easy (and “developers” are the end-users for me, so nvcc is a prerequisite) to do things like this, and it seems to be the best approach in the absence of device-side linking. OpenCL definitely has an advantage here, as the compiler is built into the driver, so everyone already has it.

(In one of my previous projects, I used the template trick to generate all the possible combinations of kernel implementations I could possibly call at runtime. Was ugly but effective.)

njuffa · December 18, 2010, 1:17am

[Clarified the wording of the initial paragraph in a later edit – nj]

Function pointers for device functions are supported in CUDA 3.2 on sm_2x platforms, based on the ABI that was introduced with CUDA 3.1. Here is a very simple app that shows that everything works exactly the same as function pointers in host code.

#include <stdio.h>

#include <stdlib.h>

#define N 5

__device__ float add_func (float x, float y)

{

    return x + y;

}

__device__ float mul_func (float x, float y)

{

    return x * y;

}

__device__ float div_func (float x, float y)

{ 

    return x / y;

}

typedef float (*op_func) (float, float);

__device__ op_func func[3] = { add_func, mul_func, div_func };

__device__ char* op_name[3] = { "add", "mul", "div" };

__device__ void op_array (const float *a, const float *b, float *res, int op, int n)

{

    for (int i = 0; i < N; i++) {

        res[i] = func[op](a[i], b[i]);

    }

}

__global__ void kernel (void)

{

    float x[N];

    float y[N];

    float res[N];

for (int i = 0; i < N; i++) {

        x[i] = (float)(10 + i);

    }

    for (int i = 0; i < N; i++) {

        y[i] = (float)(100 + i);

    }

for (int op = 0; op < 3; op++) {

        printf ("\nop=%s\n", op_name[op]);

        op_array (x, y, res, op, N);

        for (int i = 0; i < N; i++) {

            printf ("res = % 16.9e\n", res[i]);

        }

    }

}

int main (void) 

{

    kernel<<<1,1>>>();

    cudaThreadSynchronize();

    return EXIT_SUCCESS;

}

The build and run log (Linux64, C2050) looks as follows:

~ $ nvcc -arch=sm_20 -o funcptr funcptr.cu

~ $ funcptr

op=add

res =  1.100000000e+02

res =  1.120000000e+02

res =  1.140000000e+02

res =  1.160000000e+02

res =  1.180000000e+02

op=mul

res =  1.000000000e+03

res =  1.111000000e+03

res =  1.224000000e+03

res =  1.339000000e+03

res =  1.456000000e+03

op=div

res =  1.000000015e-01

res =  1.089108884e-01

res =  1.176470593e-01

res =  1.262135953e-01

res =  1.346153915e-01

njuffa · December 18, 2010, 2:05am

It has been pointed out to me that device function pointers are discussed in section B.1.4.4 of the CUDA C Programming Guide, and that there is an example application FunctionPointers in the SDK.

Exhodus · December 18, 2010, 7:03pm

Well templating is not an option for me (at least based on the little what i know about templates, but these preprocess postprocess functors done with templates sound very interesting, can you show a quick example of that? thanks), because the functions not have to be alike. What I try to create is a system that closed and easy to use, you just give it some arguments, but one of these “arguments” is actually has to be a complete function, that tells the algorithm what to do with the data provided, so it can calculate how good is the given individual. And this function is totally problem specific, it could be a simple sqrt() to some insane differential equation. So i Have to link it somehow and prefereably link it through C++ host code, not just within the kernel, i have to pass the pointer to the kernel from the host function. I’ve seen the example of device funcion pointer here, and i have to admit Im pretty new to device pointers and dont know if that example applies to my case, and for testing I have to get a brand new VGA…

So I would be really grateful if someone can help me clarify this :)

Thanks in advance!

sergeyn · December 23, 2010, 10:09pm

If your function set is unknown at compile time (for example - if function is defined by the user input in the edit box), then the only option I can see for you is to generate the kernel source and dynamically compile your kernel. The 3.2 api allows you to do that with ModuleLoadDataEx.

brdavs · March 2, 2011, 2:07am

Is it possible to set up function pointers table dynamically from the host?

For example, if I later want to implement other op_func functions, how would you register

those functions dynamically from the host:

__device__ op_func func[MAX_NUM_OPS];

How should add_func, mul_func, and div_func be properly registered

from the host?

avidday · March 2, 2011, 8:57am

I don’t think so (or at least I have not been able to get it to work). Using the driver API, it doesn’t seem possible to call cuModuleGetFunction with a device function name and retrieve a function handle. Looking inside a cubin file, only global functions seem to get written into the ELF index, even though device functions are visible in the symbol table. I am guessing that unless and object has an index entry in the cubin, the driver API does not expose any way of getting a handle to it.

tera · March 2, 2011, 11:23am

Maybe not automatically, but can’t you manually keep a pointer to device functions in a device variable and then manipulate it from the host as desired?

avidday · March 2, 2011, 12:35pm

Occam’s Razor in action. That does work just fine. Thanks for the suggestion.

brdavs · March 3, 2011, 9:19pm

Could you, please, provide an example? I believe I tried this to no avail.

tera · March 3, 2011, 11:50pm

I’m thinking something like this (although I haven’t actually tested it):

__device__ int func1(int par1)

{

    ...

}

__device__ int func2(int par1)

{

    ...

}

__device__ int func3(int par1)

{

    ...

}

__constant__ int (*func_table_d(int))[3] = {&func1, &func2, &func3);

int (*func_table(int))[3];

__global__ testkernel((*func)(int))

{

   ...

}

int main(void)

{

    int n=1;

    ...

    cudaMemcpyFromSymbol(func_table, "func_table_d", sizeof(func_table), 0, cudaMemcpyDeviceToHost);

testkernel<<<12, 34>>>(func_table[n]);

    ...

}

Caveat: Completely untested!

avidday · March 4, 2011, 7:05am

I was able to prove it to myself using this hacked version of Norbert’s example:

#include <stdio.h>

#include <stdlib.h>

#define N 5

__device__ float add_func (float x, float y)

{

    return x + y;

}

__device__ float mul_func (float x, float y)

{

    return x * y;

}

__device__ float div_func (float x, float y)

{ 

    return x / y;

}

typedef float (*op_func) (float, float);

__device__ op_func action;

__device__ op_func funcs[3] = { add_func, mul_func, div_func };

__device__ void op_array (const float *a, const float *b, float *res, op_func f, int n)

{

    for (int i = 0; i < N; i++) {

        res[i] = f(a[i], b[i]);

    }

}

__global__ void kernel (void)

{

    float x[N];

    float y[N];

    float res[N];

for (int i = 0; i < N; i++) {

        x[i] = (float)(10 + i);

    }

    for (int i = 0; i < N; i++) {

        y[i] = (float)(100 + i);

    }

op_array (x, y, res, action, N);

    for (int i = 0; i < N; i++) {

        printf ("res = % 16.9e\n", res[i]);

    }

}

#include <stdio.h>

#include <cuda.h>

int main(void)

{

	CUmodule   mhandle;

	CUcontext  chandle;

	CUfunction khandle;

	CUdeviceptr funcshandle, actionhandle;

	size_t op_funcsz;

	cuInit(0);

	cuCtxCreate( &chandle, 0, 0 );

	cuModuleLoad( &mhandle, "funcpointer.cubin");

	cuModuleGetFunction( &khandle, mhandle, "_Z6kernelv");

	cuModuleGetGlobal( &funcshandle, NULL, mhandle, "funcs");

	cuModuleGetGlobal( &actionhandle, &op_funcsz, mhandle, "action");

	for(int i=0; i<3; i++) {

		size_t offset = size_t(i) * op_funcsz;

		cuMemcpyDtoD(actionhandle, funcshandle+offset, op_funcsz);

		cuFuncSetBlockShape( khandle, 1, 1, 1 );

		cuLaunchGrid( khandle, 1, 1);

		cuCtxSynchronize();

	}

	cuCtxDestroy(chandle);

	return 0;

}

avidday@cuda:~$ nvcc -cubin -arch=sm_20 -Xptxas="-v" funcpointer.cu 

ptxas info    : Compiling entry function '_Z6kernelv' for 'sm_20'

ptxas info    : Used 22 registers, 32 bytes cmem[0], 28 bytes cmem[14]

avidday@cuda:~$ g++ funcpointmain.c -I $CUDA_INSTALL_PATH/include -lcuda -o funcpointer.exe

avidday@cuda:~$ ./funcpointer.exe 

res =  1.100000000e+02

res =  1.120000000e+02

res =  1.140000000e+02

res =  1.160000000e+02

res =  1.180000000e+02

res =  1.000000000e+03

res =  1.111000000e+03

res =  1.224000000e+03

res =  1.339000000e+03

res =  1.456000000e+03

res =  1.000000015e-01

res =  1.089108884e-01

res =  1.176470593e-01

res =  1.262135953e-01

res =  1.346153915e-01

brdavs · March 5, 2011, 9:23am

This works, yes.

But it’s a slightly different than the problem I had in mind.

Some pseudocode is below.

On device:

typedef float (*op_func) (float, float);

struct foo

{

   float a, b;

   op_func op;

};

__device__ foo* foo_array; 

__global__ void kernel( ... )

{

	float val = ( *foo_array[i].op )( .... );

}

On host:

// Allocate memory for foo_array

foo* foo_array_h;

cudaMalloc( foo_array_h, sizeof( foo )*N );

for( i = 0; i < N; i++ ) {

     foo_array_h[i].a = ..

     foo_array_h[i].b = ..

     foo_array_h[i].op = ??????

}

// Copy foo_array_h to device to

// be used in kernel

Since you cannot take an address of a device function

on host, I can’t set the op function pointer. It seems that this

cannot be done at all given current CUDA restrictions.

One way around it is to create a table of function pointers on

a device and and index to this table instead of the pointer

itself:

struct foo

{

   float a, b;

   int op;

};

__device__ foo* foo_array;

__device__ op_func op_array[] = { add_op, mul_op, ... }

__global__ void kernel( ... )

{

	float val = ( *op_array[foo_array[i].op] )( .... );

}

This works, but it’s not very flexible and requires maintaining

a table of all ops which becomes problematic if other users

(that don’t necessarily have access to the op table) are

allowed to write their own op_func and extend the table.

Sugestions?

avidday · March 5, 2011, 10:51am

But you can, and the code I posted does exactly that:

cuModuleGetGlobal( &funcshandle, NULL, mhandle, "funcs");

This is getting a pointer to the array of function pointers on the device side. You should be able to do the assignment you want using the same idea: read the function pointer value from a device symbol, then assign it to a value in host memory and copy it back into device memory. My code uses a device to device copy to do the same thing, but the idea is still the same.

Topic		Replies	Views
Device function pointers: Is it possible to use them in a useful way? CUDA Programming and Performance	16	9321	May 20, 2020
function pointers CUDA Programming and Performance	8	10834	June 15, 2007
CUDA kernel is slow with function pointers CUDA Programming and Performance cuda	11	2825	November 10, 2020
Get device side function pointer of cuda kernel (without the signature) using just the kernel's symbol name as a string may be CUDA Programming and Performance	5	217	September 27, 2024
Is this correct way to code function pointers? CUDA Programming and Performance	4	2540	March 12, 2009
device function pointers CUDA Programming and Performance	0	578	January 25, 2012
Kernel functions as parameters? CUDA Programming and Performance	5	5398	February 11, 2011
smart ideas for an interesting problem CUDA Programming and Performance	21	9751	December 10, 2008
__device__ function clarifications CUDA Programming and Performance	6	21646	December 10, 2008
About function pointer on cuda 3.2 CUDA Programming and Performance	4	5704	September 8, 2011

How can I use __device__ function pointer in CUDA ?

Related topics

How can I use device function pointer in CUDA ?