CUDA runtimes bigger that CPU without loading time include

Dear ,
I have implement an image processing algorithm with cuda but the runtimes of the kernel is is actually 3 times the runtimes of the cpu without the loading times of data take into account.
in the kernel I only used the share memory.
The algo is basically

an input matrix of size 121x256 and ouput a matrix of size 150x150

Please can you just give some hint on where so start optimizing my algorithm.

Thanks
Willer

is that from [url=“http://forums.nvidia.com/index.php?showtopic=166364”]http://forums.nvidia.com/index.php?showtopic=166364[/url] this theme?

So you mass 150x150 threads? Or do you use global atomics?

Yes it is actually the same.

But I don´t use any atomic just share memory and the size is 6.1kb

Si I have as input a matrix of size 121x512 << gridsize(1,121) blockSize(512, 1,1)>>

and the output matrix is 150x150

And what is your gpu and cpu?

Please what do you mean by gpu and cpu ?

the runtimes on gpu is 0.685 s and on cpu is 0.30

and as I increase the si the gpu increase more than the cpu

this is my kernel pseudo code

[codebox]

global void kernel(){

int yy   = blockIdx.y  ;

int xx   = threadIdx.x ;

int thread_index =0;

int nbr_element_per_threads=number_of_element_per_thread ; // offset

int thread_start_index = threadIdx.x*nbr_element_per_threads;

int thread_end_index   = nbr_element_per_threads*(threadIdx.x + 1);

	encoder_map(0 ,&pos_x,&pos_y); (device function)

	offset_col  = map_cell_offset_column(p_dev_map_mapping_config, pos_x); (device function)

	offset_row  = map_cell_offset_row   (p_dev_map_mapping_config, pos_y);

	for(thread_index= thread_start_index ; thread_index < thread_end_index; thread_index++)

	{

		column = new column compute (mathematic function invoving 2 cos)

		row    =  new row compute   (mathematic function invoving 2 sin)

		shared_data[thread_index].value	= read gobalmemory with yy and xx;

		shared_data[thread_index].row_index = row + offset_row ;

		shared_data[thread_index].col_index = column + offset_col ; 

	}

	for(thread_index= thread_start_index ; thread_index < thread_end_index; thread_index++)

	{

	    value	= shared_data[thread_index].value		;

	    row		= shared_data[thread_index].row_index	;

	    column	= shared_data[thread_index].col_index	; 

		if( (row  >= 0) && (row <= p_dev_map_mapping_config->row_max) 

&& (column >= 0) && (column <= p_dev_map_mapping_config->col_max))

		{

			

			cellIndex=  (p_dev_map_mapping_config->col_max)*row + column;	

			

			if(value < 0)

				value= -value;			

											

			if( output_dev_pixel_short[cellIndex]< value)

			{				

				output_dev_pixel_short[cellIndex] = value;

			}

		}

	}

}[/codebox]

“Please what do you mean by gpu and cpu ?”

model name.

what is p_dev_map_mapping_config?
global memory pointer?

yes it is actually the global memory pointer and also the value is read in the gloabal momery with the index of the block and thread

is this value the same among all threads in block?
p_dev_map_mapping_config->row_max

yes

p_dev_map_mapping_config: this global memory pointer the the same for all thread

You have a lot of space to improve speed. Just think a bit, what is the point of loading from global memory value which is common for all threads? You may pass it as kernel parameter.
And instead of
if( (row >= 0) && (row <= p_dev_map_mapping_config->row_max) && (column >= 0) && (column <= p_dev_map_mapping_config->col_max))

use min and max at the computation of row and column and so on.