CUDA runtimes bigger that CPU without loading time include

texwiller204 · April 28, 2010, 12:23pm

Dear ,
I have implement an image processing algorithm with cuda but the runtimes of the kernel is is actually 3 times the runtimes of the cpu without the loading times of data take into account.
in the kernel I only used the share memory.
The algo is basically

an input matrix of size 121x256 and ouput a matrix of size 150x150

Please can you just give some hint on where so start optimizing my algorithm.

Thanks
Willer

Lev · April 28, 2010, 12:39pm

is that from [url=“http://forums.nvidia.com/index.php?showtopic=166364”]http://forums.nvidia.com/index.php?showtopic=166364[/url] this theme?

So you mass 150x150 threads? Or do you use global atomics?

texwiller204 · April 28, 2010, 1:48pm

Yes it is actually the same.

But I donÂ´t use any atomic just share memory and the size is 6.1kb

Si I have as input a matrix of size 121x512 << gridsize(1,121) blockSize(512, 1,1)>>

and the output matrix is 150x150

Lev · April 28, 2010, 2:05pm

And what is your gpu and cpu?

texwiller204 · April 28, 2010, 2:14pm

Please what do you mean by gpu and cpu ?

the runtimes on gpu is 0.685 s and on cpu is 0.30

and as I increase the si the gpu increase more than the cpu

this is my kernel pseudo code

[codebox]

global void kernel(){

int yy   = blockIdx.y  ;

int xx   = threadIdx.x ;

int thread_index =0;

int nbr_element_per_threads=number_of_element_per_thread ; // offset

int thread_start_index = threadIdx.x*nbr_element_per_threads;

int thread_end_index   = nbr_element_per_threads*(threadIdx.x + 1);

	encoder_map(0 ,&pos_x,&pos_y); (device function)

	offset_col  = map_cell_offset_column(p_dev_map_mapping_config, pos_x); (device function)

	offset_row  = map_cell_offset_row   (p_dev_map_mapping_config, pos_y);

	for(thread_index= thread_start_index ; thread_index < thread_end_index; thread_index++)

	{

		column = new column compute (mathematic function invoving 2 cos)

		row    =  new row compute   (mathematic function invoving 2 sin)

		shared_data[thread_index].value	= read gobalmemory with yy and xx;

		shared_data[thread_index].row_index = row + offset_row ;

		shared_data[thread_index].col_index = column + offset_col ; 

	}

	for(thread_index= thread_start_index ; thread_index < thread_end_index; thread_index++)

	{

	    value	= shared_data[thread_index].value		;

	    row		= shared_data[thread_index].row_index	;

	    column	= shared_data[thread_index].col_index	; 

		if( (row  >= 0) && (row <= p_dev_map_mapping_config->row_max)

&& (column >= 0) && (column <= p_dev_map_mapping_config->col_max))

		{

			

			cellIndex=  (p_dev_map_mapping_config->col_max)*row + column;	

			

			if(value < 0)

				value= -value;			

											

			if( output_dev_pixel_short[cellIndex]< value)

			{				

				output_dev_pixel_short[cellIndex] = value;

			}

		}

	}

}[/codebox]

Lev · April 28, 2010, 2:17pm

“Please what do you mean by gpu and cpu ?”

model name.

what is p_dev_map_mapping_config?
global memory pointer?

texwiller204 · April 28, 2010, 2:24pm

yes it is actually the global memory pointer and also the value is read in the gloabal momery with the index of the block and thread

Lev · April 28, 2010, 2:31pm

is this value the same among all threads in block?
p_dev_map_mapping_config->row_max

texwiller204 · April 28, 2010, 2:38pm

yes

texwiller204 · April 28, 2010, 2:42pm

p_dev_map_mapping_config: this global memory pointer the the same for all thread

Lev · April 28, 2010, 3:14pm

You have a lot of space to improve speed. Just think a bit, what is the point of loading from global memory value which is common for all threads? You may pass it as kernel parameter.
And instead of
if( (row >= 0) && (row <= p_dev_map_mapping_config->row_max) && (column >= 0) && (column <= p_dev_map_mapping_config->col_max))

use min and max at the computation of row and column and so on.