speedloss with pointer in a function call

Hi all !

I have a “strange” behavior with my program and I would like to have some explanation.

I call a function from a kernel and when I add a pointer in the prototype of the function, I get a speed-loss of 20%. The pointer is a tab allocated with cudaMalloc.

I will show you the code do be clear.


float* other_tab, tab;

cudaMalloc( &other_tab, SIZE*sizeof(*other_tab) );

cudaMalloc( &tab, SIZE*sizeof(*tab) );

First Code

__global__ void Kernel( float* other_tab ){

function( other_tab );


Second Code (with speed-loss of 20%)

__global__ void Kernel( float* other_tab,  float* tab){

function( other_tab, tab );


The inside of the function stays the same, only the number of parameters is changing. So tab is useless on the second example but it is only to my test.

Do you have some solutions or answers ?

  • Is it faster to use only a structure with all the tab I want to call with my function ? (I don’t think so)

  • Should I use constant memory if my tab isn’t modify in the kernel? If yes is it possible with dynamic allocation (I don’t think so either) ?

I use CUDA 4.1 on the a GeForce GTX 580 under CentOS 6.

I don’t want to speculate, given the minmal amount of information provided. But I would suggest you add the const qualifier to all pointers to read-only memory, and furthermore add restrict if pointers are for non-overlapping and non-aliases memory objects.

Assuming both table are read-only, you would wind up with

__global__ void Kernel(const float * __restrict__ other_tab, const float * __restrict__ tab);

Does that change the performance at all?

It changes nothing unfortunately. Do I have to leave these key word?

And in fact I define all my tab in a structure. Here is my code.

/** Host Code **/

typedef struct __align__(16)


	float* tab;

	float* other_tab;

} Tab

/** Device Code **/

__global__ void Kernel1( Tab myStructure){

function( myStructure.other_tab );


__global__ void Kernel2( Tab myStructure){

function( myStructure.other_tab, myStructure.tab );


// And the prototype of my function is :

void function( const float * __restrict__ other_tab, const float * __restrict__ tab );

Is it better with my explanation?

Don’t hesitate to ask me for more information, I know I have some trouble to explain my problem and to give all the useful information.

So I made some other tests, and I think the speed loss come from the size of the code. Indeed, if I comment around 30 useless lines in my code and delete one parameter of my function, I earn some execution time.
By the way, is there a limit in the size of the parameter of the function to not lost time. I mean does the copy of all data needed for the function (parameters, function address) are made in the same time according to the size of the bus? (I don’t know if I am clear, I tried my best !).