function pointers

acorrigan · June 14, 2007, 9:37pm

Is there a possibility of CUDA supporting device function pointers within global functions in the future? Is there an alternative?

I have a global function which in the innermost loop I need to call a device function. But in different calls to the global I want to call a different device function inside that loop. Since function pointers aren’t supported for device functions, that means that I would need an if statement to decide, in the innermost part of my loop, but that’ll be unnecessarily slow. The only other alternative, which is not an attractive one is to duplicate a rather large global function once for each different device function I want to support.

I’m speaking in very general terms, so below I provided the particular code which I’m referring to.

__device__ float fourier_transform_sph(float r)

{

	if(r >= 0.06f) 

	{

  float m = pi*r;

  float _2_m = 2.0f*m;

  float cos_2_m, sin_2_m;

  __sincosf(_2_m, &sin_2_m, &cos_2_m);	

  return ((2.3561944901923448f)/(m*m*m*m*m*m))*(cos_2_m-1.0f)*(cos_2_m+m*sin_2_m-1.0f);	

	}

	return (pi + r*(-0.007968913156311f + r*(-18.293608272337678f)));  }

__device__ float fourier_transform_wendland_d3_c2(float r)

{

	if(r >= 0.24f) 

	{

  float m = pi*r;

  float cos_2_m, sin_2_m;

  __sincosf(2.0f*m, &sin_2_m, &cos_2_m);

  float m_2 = m*m;

  float m_4 = m_2*m_2;

  return ((pi*7.5f)/(m_4*m_4))*(4.0f*m_2 - 6.0f + (6.0f-m_2)*cos_2_m+4.5f*m*sin_2_m);

	}

	return 0.299199300341890f + r*(-0.002379178586124f + r*(-0.370530218163545f));

}

__global__ void fourier_transform_interpolant_half_domain(int first_term, int number_of_terms, MeshlessInterpolant meshless_interpolant, FVRConfig fvr_config)

{

	extern __shared__ float shared[];

	int block_area = fvr_config._block_area;

	float3* ds_positions = (float3*)shared;

	float* ds_weights = (float*)(ds_positions + block_area);

	float* ds_radii = ds_weights + block_area;

	int x = (blockDim.x*blockIdx.x + threadIdx.x);

	int y = (blockDim.y*blockIdx.y + threadIdx.y);

	int index_x = x;

	if(x > fvr_config.cutoff_frequency.x) 

	{

  x = x-2*fvr_config.cutoff_frequency.x;

  index_x = (x+fvr_config.number_of_samples.x);

	}

	int index = index_x*(fvr_config.number_of_samples.y/2 + 1) + y;

	float fu = fvr_config.step_size.x*x, fv = fvr_config.step_size.y*y;

	

	float3 f_coord = make_float3(fu*fvr_config.u_axis.x + fv*fvr_config.v_axis.x, fu*fvr_config.u_axis.y + fv*fvr_config.v_axis.y, fu*fvr_config.u_axis.z + fv*fvr_config.v_axis.z);

	float r = sqrtf((float)(fu*fu + fv*fv));

	float2 sum = make_float2(0.0f, 0.0f);

	float sin_v, cos_v;

	int thread_index = blockDim.x*threadIdx.y + threadIdx.x;

	for(unsigned int k = first_term; k < number_of_terms; ) 

	{

  if (k+thread_index < number_of_terms)

  {

  	ds_positions[thread_index] = meshless_interpolant.d_positions[k + thread_index];

  	ds_weights[thread_index] = meshless_interpolant.d_weights[k + thread_index];

  	ds_radii[thread_index] = meshless_interpolant.d_radii[k + thread_index];

  }

  __syncthreads();

 for(int j = 0; j != block_area && k < number_of_terms; ++j, ++k)

  {

  	float term = ds_weights[j]*fourier_transform_sph(r*ds_radii[j]); //HERE is where I would like to call an arbitrary device function, but I would like to decide which one outside of this double loop

  	__sincosf(_2pi*dot(f_coord, ds_positions[j]), &sin_v, &cos_v);

  	sum.x += term*cos_v;

  	sum.y += term*sin_v;

  }

  __syncthreads();

	}

	float2 prev = fvr_config._d_half_image[index];

	fvr_config._d_half_image[index].x = prev.x + sum.x;

	fvr_config._d_half_image[index].y = prev.y - sum.y;

}

Also, if you happen to have looked at this code, and can offer some performance hints, they’re very much appreciated.

Simon_Green · June 14, 2007, 10:28pm

Unfortunately function pointers are not supported on current hardware.

It seems like your problem is just a question of programming convenience - you don’t actually need to choose which function to call at runtime. Could you use C++ templates or just a macro instead?

It’s worth noting that function calls are always inlined currently anyway.

acorrigan · June 14, 2007, 11:05pm

Thanks. I didn’t realize it would be a hardware issue and I didn’t think that templates were possible, I’m giving that a shot. Also, I had totally forgotten about macros, and that is an acceptable alternative as well.

acorrigan · June 15, 2007, 12:35am

Templates don’t seem to be working, I get this error message (CUDA 0.9 on 64-bit Linux):

nvcc  -o obj/release/meshless_fvr.cu_o -c meshless_fvr.cu -I/home/acorrigan/cuda/include -I/home/acorrigan/cuda/sdk/common/inc -I. -I../include -I/usr/include -DUNIX -O3

Signal: Segmentation fault in Codegen Driver phase.

Error: Signal Segmentation fault in phase Codegen Driver -- processing aborted

*** Internal stack backtrace:

    /home/acorrigan/cuda/open64/lib//be [0x6c37c3]

    /home/acorrigan/cuda/open64/lib//be [0x6c3846]

    /home/acorrigan/cuda/open64/lib//be [0x6c46e2]

    /home/acorrigan/cuda/open64/lib//be [0x6c47ec]

    /home/acorrigan/cuda/open64/lib//be [0x6c4949]

    /home/acorrigan/cuda/open64/lib//be [0x6c2ff5]

    /lib/libc.so.6 [0x2b7265c6dd40]

    /home/acorrigan/cuda/open64/lib//be [0x41bc2a]

    /home/acorrigan/cuda/open64/lib//be [0x41bac8]

    /home/acorrigan/cuda/open64/lib//be [0x41b971]

    /home/acorrigan/cuda/open64/lib//be [0x41b873]

    /home/acorrigan/cuda/open64/lib//be [0x41b6a3]

    /home/acorrigan/cuda/open64/lib//be [0x6d6bde]

    /home/acorrigan/cuda/open64/lib//be [0x6d6f10]

    /home/acorrigan/cuda/open64/lib//be [0x6d7ee4]

    /home/acorrigan/cuda/open64/lib//be [0x6dc1d4]

    /home/acorrigan/cuda/open64/lib//be [0x6dbdfc]

    /home/acorrigan/cuda/open64/lib//be [0x6d2d85]

    /home/acorrigan/cuda/open64/lib//be [0x6d6660]

    /home/acorrigan/cuda/open64/lib//be [0x6d60ba]

    /home/acorrigan/cuda/open64/lib//be [0x6d6b98]

    /home/acorrigan/cuda/open64/lib//be [0x6d7eab]

    /home/acorrigan/cuda/open64/lib//be [0x6dc1d4]

    /home/acorrigan/cuda/open64/lib//be [0x6dbdfc]

    /home/acorrigan/cuda/open64/lib//be [0x6da4cc]

    /home/acorrigan/cuda/open64/lib//be [0x6dbf2f]

    /home/acorrigan/cuda/open64/lib//be [0x6dc1b4]

    /home/acorrigan/cuda/open64/lib//be [0x6dbdfc]

    /home/acorrigan/cuda/open64/lib//be [0x6da4cc]

    /home/acorrigan/cuda/open64/lib//be [0x6dbf2f]

    /home/acorrigan/cuda/open64/lib//be [0x6dc1b4]

nvopencc INTERNAL ERROR: /home/acorrigan/cuda/open64/lib//be died due to signal 4

Is this a bug, or is my usage of templates not supported? I tried use templates in the following way:

template <typename FourierTransformKernel>

__global__ void fourier_transform_interpolant_half_domain(FourierTransformKernel fourier_transform_kernel, int first_term, int number_of_terms, MeshlessInterpolant meshless_interpolant, FVRConfig fvr_config)

Where I replace the call to fourier_transform_sph with fourier_transform_kernel from above, and fourier_transform_sph is passed as a parameter to to fourier_transform_interpolant_half_domain.

I basically tried to pass a function in the same way I do in this little C++ code:

#include <iostream>

template<typename func>

void call(func f)

{

	std::cout << f() << std::endl;

}

int f() { return 3; }

int main(int argc, char** argv)

{

	call(f);

	return 0;

}

This usage of templates worked in C++ with gcc. Will it be supported by CUDA in the near future, if at all? If not, is there another way to use templates to achieve the same thing, so that would work? I actually looked into using macros, and it seems so complicated, I’d rather just generate the code using python.

prkipfer · June 15, 2007, 9:56am

Templates are not supported as CUDA code is supposed to be C code.

Before actually compiling the code with nvopencc, there are a couple of preprocessing steps taking place (watch them with nvcc -v). They are using cl or gcc for doing the preprocessing. So you can use template notation only if the template can be fully specialized by the preprocessor.

Function template arguments cannot be resolved in the preprocessor, as the function has a runtime dependent type. You can however use POD types and their values and then rely on the dead code elimination phase later in the process to actually glue only the pieces you want together.

Example: device code

__device__ float calculateMagic()

{

   magic code

}

template <bool needMagic>

__global__ void myKernel(float* data)

{

  myData = data[threadIdx.x];

 if (needMagic) // compile time decision !

    myData += calculateMagic();

}

Host code

void runKernel(bool magicNeeded)

{

  if (magicNeeded)

    myKernel<true><<<grid,threads>>>(data);

  else

    myKernel<false><<<grid,threads>>>(data);

This will actually generate two specialized kernels for you. You can extend this idea for several template parameters. The compiler will then generate all combinations needed, each representing an optimized kernel. Neat. Hope this helps.

Peter

acorrigan · June 15, 2007, 1:50pm

That looks pretty good. Thanks a lot!

If I could have the if-statement only once, that’d be better in terms of writing more concise code. But the main thing is that the innermost if-statement gets compiled away.

prkipfer · June 15, 2007, 2:08pm

The if statement for the template call in the host code is in host code :)

That is, there you can use function tables. So do an init function that does the if and stores the obtained function pointer (aka. kernel factory).

Peter

acorrigan · June 15, 2007, 3:23pm

Thanks again. I understand the first part of your post, but I don’t understand what you mean in the following:

Is it possible to illustrate what you mean? I don’t understand how this init function would store a device function pointer.

prkipfer · June 15, 2007, 3:48pm

Haven’t tried to store a device function pointer yet directly. You can always store a function pointer to an (inline) wrapper.

void magicKernel(float* data)

{

    myKernel<true><<<grid,threads>>>(data);

}

void basicKernel(float* data)

{

   myKernel<false><<<grid,threads>>>(data);

}

void (*func)(float* data) kernels[2];

void init()

{

  kernels[0] = basicKernel;

  kernels[1] = magicKernel;

}

Then call kernels[bool].

Peter

Topic		Replies	Views
How can I use __device__ function pointer in CUDA ? CUDA Programming and Performance	34	60114	June 3, 2020
Struct in CUDA can i use this struct in CUDA CUDA Programming and Performance	15	89458	June 26, 2009
Device function pointers: Is it possible to use them in a useful way? CUDA Programming and Performance	16	8836	May 20, 2020
Is CUDA C or C++ ? CUDA Programming and Performance	12	33647	January 30, 2009
multi dimension array CUDA Programming and Performance	26	32775	February 12, 2010
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134574	May 26, 2010
Subsitute for passing functions by address CUDA Programming and Performance	12	7279	August 26, 2009
Kernel functions as parameters? CUDA Programming and Performance	5	5261	February 11, 2011
Memory Corruption on a Fermi-Class GPU Error only on Fermis, program works on non-Fermis. CUDA Programming and Performance	18	7108	July 22, 2011
Error: kernel launch from __device__ or __global__ functions requires separate compilation mode CUDA Programming and Performance cuda	9	2847	November 20, 2023

function pointers

Related topics