Passing a function ptr

Hi guys

I need to a simple stuff like this:

template<class T>
__global__ void fill_func(T *base,T (func*)(int),int width) {
	int i=threadIdx.x + blockDim.x * blockIdx.x;
	if (i<width) base[i]=func(i);
	__syncthreads();
}

At compile time nvcc returns:
[blabla]/…cuda_utils.cuh(57): error: identifier “func” is undefined
[blabla]/…cuda_utils.cuh(57): error: function returning function is not allowed

my intention is to pass a function pointer so to fill the array with func’s result like this:

device increment(int i) {
return (float)(i);
}

fill_func<<<SIZE,SIZE/256>>>(array_base,increment_func,SIZE)
Does cuda 11.2 support this?
Thanks

Yes, with some modifications.

You might start with a pattern that would actually compile in c++ (for this exercise, realize the functions as you would if they were all non-CUDA, host code functions.) For example, in C++ the prototype of a function pointer is (*func)() not (func*)(). Doing so would get rid of the mistakes you are making that are leading to the errors you indicate. If you did that, you might end up with something like this which compiles without errors:

template <class T>
__global__ void fill_func(T *base,T (*func)(int),int width) {
        int i=threadIdx.x + blockDim.x * blockIdx.x;
        if (i<width) base[i]=func(i);
        __syncthreads();
}

__device__ float  increment_func(int i) {
return (float)(i);
}

int main(){
  const int SIZE = 256;
  float *array_base = NULL;
  fill_func<<<SIZE,SIZE/256>>>(array_base,increment_func,SIZE);
}

However this still wouldn’t work correctly. There is one additional issue to deal with, and that is that in CUDA, __device__ addresses cannot be (directly) captured in host code. Therefore this:

  fill_func<<<SIZE,SIZE/256>>>(array_base,increment_func,SIZE);
                                          ^^^^^^^^^^^^^^

isn’t going to do what you might think. The topic of function pointer usage is covered in many places, including the programming guide, some CUDA sample codes, and various forum questions such as here. From among the many examples of ways to work around this issue linked from that question, I’ll pick one:

$ cat t48.cu
template <class T>
__global__ void fill_func(T *base,T (*func)(int),int width) {
        int i=threadIdx.x + blockDim.x * blockIdx.x;
        if (i<width) base[i]=func(i);
        __syncthreads();
        if (i < 2) printf("thread: %d  : %f\n", i, base[i]);
}

__device__ float  increment_func(int i) {
return (float)(i);
}
__device__ float (*d_increment_func)(int) = increment_func;

int main(){
  const int SIZE = 256;
  float *array_base;
  cudaMalloc(&array_base, 256*sizeof(array_base[0]));
  float (*h_d_increment_func)(int);
  cudaMemcpyFromSymbol(&h_d_increment_func, d_increment_func, sizeof (h_d_increment_func));
  fill_func<<<SIZE,SIZE/256>>>(array_base,h_d_increment_func,SIZE);
  cudaDeviceSynchronize();
}
$ nvcc -o t48 t48.cu
$ cuda-memcheck ./t48
========= CUDA-MEMCHECK
thread: 0  : 0.000000
thread: 1  : 1.000000
========= ERROR SUMMARY: 0 errors
$

Thnakyou,
very kind for your useful answer