__device__ function pointer: Speedloss?

Hi all,

i have a “problem” with pointers to device functions. When i use pointer to a device function instead of the function in a direct way i lose about 40% of speed.

So my questions is: Is this the speed drop which is expected or did i something wrong.

What could also be a problem is, that the device function pointer is saved on host memory with a “type def”. Here is some of the code:

__device__ double DevTestfkt2( double x )

{

    return ( x *sin(3*log(x+ x*x +1)));

}

typedef double (*op_func_t) (double);

__device__ op_func_t GlobDevFunktion2Int = DevTestfkt2;

Then i copy the Pointer during a Class initiation, where the class holds a op_func_t with the name DevFunction2Int, with the Code:

cudaMemcpyFromSymbol(&(this->DevFunction2Int), GlobDevFunktion2Int,  sizeof(op_func_t))

Then i call the Kernel with:

DevTrapez3 <<< Blocks , Threads, Threads*sizeof(double) >>> (Xstart , Xend , h , dev_Integral, dev_BlockSumme, this->DevFunction2Int);

where the Kernel is:

__global__ void DevTrapez3 ( double a, double b, double h, double* Integral, double* BlockSummen, op_func_t DevTestfkt)

{

    if(h==b-a)

    {

        if(threadIdx.x==0)*Integral=0.5*(b-a)*( (*DevTestfkt)(a) + (*DevTestfkt)(b) );

    }

    else

    {

        int index = threadIdx.x + blockIdx.x * blockDim.x;

extern __shared__ double Stutzstellen[];

double x =  a + h *(1+ index + (index)/2 ); //Indexverschiebung um vorhandene Stützstellen zu verwenden

        if( x < b ){

            Stutzstellen[threadIdx.x] = (*DevTestfkt)( x );

        }

        else{

            Stutzstellen[threadIdx.x] =0;

        }

<b>........</b>

So during this process i lose the 40% compared to a inline function. The full code with the inline and the pointer methode and their output for time comparison is attached.

Thank you
IntTestPointer.cu (8.42 KB)
IntTestInline.cu (7.5 KB)
IntegrationsAusgabeInline.txt (1.11 KB)
IntegrationsAusgabePointer.txt (1.87 KB)

Do you get a similar speed difference if you use the same number of threads in both cases?

Oh i missed that i changed the Reduction Threads during reconstruction of the code. I changed it but the speed difference is still the same.