function pointers

Is there a possibility of CUDA supporting device function pointers within global functions in the future? Is there an alternative?

I have a global function which in the innermost loop I need to call a device function. But in different calls to the global I want to call a different device function inside that loop. Since function pointers aren’t supported for device functions, that means that I would need an if statement to decide, in the innermost part of my loop, but that’ll be unnecessarily slow. The only other alternative, which is not an attractive one is to duplicate a rather large global function once for each different device function I want to support.

I’m speaking in very general terms, so below I provided the particular code which I’m referring to.

__device__ float fourier_transform_sph(float r)

{

	if(r >= 0.06f) 

	{

  float m = pi*r;

  float _2_m = 2.0f*m;

  float cos_2_m, sin_2_m;

  __sincosf(_2_m, &sin_2_m, &cos_2_m);	

  return ((2.3561944901923448f)/(m*m*m*m*m*m))*(cos_2_m-1.0f)*(cos_2_m+m*sin_2_m-1.0f);	

	}

	return (pi + r*(-0.007968913156311f + r*(-18.293608272337678f)));  }

__device__ float fourier_transform_wendland_d3_c2(float r)

{

	if(r >= 0.24f) 

	{

  float m = pi*r;

  float cos_2_m, sin_2_m;

  __sincosf(2.0f*m, &sin_2_m, &cos_2_m);

  float m_2 = m*m;

  float m_4 = m_2*m_2;

  return ((pi*7.5f)/(m_4*m_4))*(4.0f*m_2 - 6.0f + (6.0f-m_2)*cos_2_m+4.5f*m*sin_2_m);

	}

	return 0.299199300341890f + r*(-0.002379178586124f + r*(-0.370530218163545f));

}

__global__ void fourier_transform_interpolant_half_domain(int first_term, int number_of_terms, MeshlessInterpolant meshless_interpolant, FVRConfig fvr_config)

{

	extern __shared__ float shared[];

	int block_area = fvr_config._block_area;

	float3* ds_positions = (float3*)shared;

	float* ds_weights = (float*)(ds_positions + block_area);

	float* ds_radii = ds_weights + block_area;

	int x = (blockDim.x*blockIdx.x + threadIdx.x);

	int y = (blockDim.y*blockIdx.y + threadIdx.y);

	int index_x = x;

	if(x > fvr_config.cutoff_frequency.x) 

	{

  x = x-2*fvr_config.cutoff_frequency.x;

  index_x = (x+fvr_config.number_of_samples.x);

	}

	int index = index_x*(fvr_config.number_of_samples.y/2 + 1) + y;

	float fu = fvr_config.step_size.x*x, fv = fvr_config.step_size.y*y;

	

	float3 f_coord = make_float3(fu*fvr_config.u_axis.x + fv*fvr_config.v_axis.x, fu*fvr_config.u_axis.y + fv*fvr_config.v_axis.y, fu*fvr_config.u_axis.z + fv*fvr_config.v_axis.z);

	float r = sqrtf((float)(fu*fu + fv*fv));

	float2 sum = make_float2(0.0f, 0.0f);

	float sin_v, cos_v;

	int thread_index = blockDim.x*threadIdx.y + threadIdx.x;

	for(unsigned int k = first_term; k < number_of_terms; ) 

	{

  if (k+thread_index < number_of_terms)

  {

  	ds_positions[thread_index] = meshless_interpolant.d_positions[k + thread_index];

  	ds_weights[thread_index] = meshless_interpolant.d_weights[k + thread_index];

  	ds_radii[thread_index] = meshless_interpolant.d_radii[k + thread_index];

  }

  __syncthreads();

 for(int j = 0; j != block_area && k < number_of_terms; ++j, ++k)

  {

  	float term = ds_weights[j]*fourier_transform_sph(r*ds_radii[j]); //HERE is where I would like to call an arbitrary device function, but I would like to decide which one outside of this double loop

  	__sincosf(_2pi*dot(f_coord, ds_positions[j]), &sin_v, &cos_v);

  	sum.x += term*cos_v;

  	sum.y += term*sin_v;

  }

  __syncthreads();

	}

	float2 prev = fvr_config._d_half_image[index];

	fvr_config._d_half_image[index].x = prev.x + sum.x;

	fvr_config._d_half_image[index].y = prev.y - sum.y;

}

Also, if you happen to have looked at this code, and can offer some performance hints, they’re very much appreciated.

Unfortunately function pointers are not supported on current hardware.

It seems like your problem is just a question of programming convenience - you don’t actually need to choose which function to call at runtime. Could you use C++ templates or just a macro instead?

It’s worth noting that function calls are always inlined currently anyway.

Thanks. I didn’t realize it would be a hardware issue and I didn’t think that templates were possible, I’m giving that a shot. Also, I had totally forgotten about macros, and that is an acceptable alternative as well.

Templates don’t seem to be working, I get this error message (CUDA 0.9 on 64-bit Linux):

nvcc  -o obj/release/meshless_fvr.cu_o -c meshless_fvr.cu -I/home/acorrigan/cuda/include -I/home/acorrigan/cuda/sdk/common/inc -I. -I../include -I/usr/include -DUNIX -O3

Signal: Segmentation fault in Codegen Driver phase.

Error: Signal Segmentation fault in phase Codegen Driver -- processing aborted

*** Internal stack backtrace:

    /home/acorrigan/cuda/open64/lib//be [0x6c37c3]

    /home/acorrigan/cuda/open64/lib//be [0x6c3846]

    /home/acorrigan/cuda/open64/lib//be [0x6c46e2]

    /home/acorrigan/cuda/open64/lib//be [0x6c47ec]

    /home/acorrigan/cuda/open64/lib//be [0x6c4949]

    /home/acorrigan/cuda/open64/lib//be [0x6c2ff5]

    /lib/libc.so.6 [0x2b7265c6dd40]

    /home/acorrigan/cuda/open64/lib//be [0x41bc2a]

    /home/acorrigan/cuda/open64/lib//be [0x41bac8]

    /home/acorrigan/cuda/open64/lib//be [0x41b971]

    /home/acorrigan/cuda/open64/lib//be [0x41b873]

    /home/acorrigan/cuda/open64/lib//be [0x41b6a3]

    /home/acorrigan/cuda/open64/lib//be [0x6d6bde]

    /home/acorrigan/cuda/open64/lib//be [0x6d6f10]

    /home/acorrigan/cuda/open64/lib//be [0x6d7ee4]

    /home/acorrigan/cuda/open64/lib//be [0x6dc1d4]

    /home/acorrigan/cuda/open64/lib//be [0x6dbdfc]

    /home/acorrigan/cuda/open64/lib//be [0x6d2d85]

    /home/acorrigan/cuda/open64/lib//be [0x6d6660]

    /home/acorrigan/cuda/open64/lib//be [0x6d60ba]

    /home/acorrigan/cuda/open64/lib//be [0x6d6b98]

    /home/acorrigan/cuda/open64/lib//be [0x6d7eab]

    /home/acorrigan/cuda/open64/lib//be [0x6dc1d4]

    /home/acorrigan/cuda/open64/lib//be [0x6dbdfc]

    /home/acorrigan/cuda/open64/lib//be [0x6da4cc]

    /home/acorrigan/cuda/open64/lib//be [0x6dbf2f]

    /home/acorrigan/cuda/open64/lib//be [0x6dc1b4]

    /home/acorrigan/cuda/open64/lib//be [0x6dbdfc]

    /home/acorrigan/cuda/open64/lib//be [0x6da4cc]

    /home/acorrigan/cuda/open64/lib//be [0x6dbf2f]

    /home/acorrigan/cuda/open64/lib//be [0x6dc1b4]

nvopencc INTERNAL ERROR: /home/acorrigan/cuda/open64/lib//be died due to signal 4

Is this a bug, or is my usage of templates not supported? I tried use templates in the following way:

template <typename FourierTransformKernel>

__global__ void fourier_transform_interpolant_half_domain(FourierTransformKernel fourier_transform_kernel, int first_term, int number_of_terms, MeshlessInterpolant meshless_interpolant, FVRConfig fvr_config)

Where I replace the call to fourier_transform_sph with fourier_transform_kernel from above, and fourier_transform_sph is passed as a parameter to to fourier_transform_interpolant_half_domain.

I basically tried to pass a function in the same way I do in this little C++ code:

#include <iostream>

template<typename func>

void call(func f)

{

	std::cout << f() << std::endl;

}

int f() { return 3; }

int main(int argc, char** argv)

{

	call(f);

	return 0;

}

This usage of templates worked in C++ with gcc. Will it be supported by CUDA in the near future, if at all? If not, is there another way to use templates to achieve the same thing, so that would work? I actually looked into using macros, and it seems so complicated, I’d rather just generate the code using python.

Templates are not supported as CUDA code is supposed to be C code.

Before actually compiling the code with nvopencc, there are a couple of preprocessing steps taking place (watch them with nvcc -v). They are using cl or gcc for doing the preprocessing. So you can use template notation only if the template can be fully specialized by the preprocessor.

Function template arguments cannot be resolved in the preprocessor, as the function has a runtime dependent type. You can however use POD types and their values and then rely on the dead code elimination phase later in the process to actually glue only the pieces you want together.

Example: device code

__device__ float calculateMagic()

{

   magic code

}

template <bool needMagic>

__global__ void myKernel(float* data)

{

  myData = data[threadIdx.x];

 if (needMagic) // compile time decision !

    myData += calculateMagic();

}

Host code

void runKernel(bool magicNeeded)

{

  if (magicNeeded)

    myKernel<true><<<grid,threads>>>(data);

  else

    myKernel<false><<<grid,threads>>>(data);

This will actually generate two specialized kernels for you. You can extend this idea for several template parameters. The compiler will then generate all combinations needed, each representing an optimized kernel. Neat. Hope this helps.

Peter

That looks pretty good. Thanks a lot!

If I could have the if-statement only once, that’d be better in terms of writing more concise code. But the main thing is that the innermost if-statement gets compiled away.

The if statement for the template call in the host code is in host code :)

That is, there you can use function tables. So do an init function that does the if and stores the obtained function pointer (aka. kernel factory).

Peter

Thanks again. I understand the first part of your post, but I don’t understand what you mean in the following:

Is it possible to illustrate what you mean? I don’t understand how this init function would store a device function pointer.

Haven’t tried to store a device function pointer yet directly. You can always store a function pointer to an (inline) wrapper.

void magicKernel(float* data)

{

    myKernel<true><<<grid,threads>>>(data);

}

void basicKernel(float* data)

{

   myKernel<false><<<grid,threads>>>(data);

}

void (*func)(float* data) kernels[2];

void init()

{

  kernels[0] = basicKernel;

  kernels[1] = magicKernel;

}

Then call kernels[bool].

Peter