Dear ,
I have implement an image processing algorithm with cuda but the runtimes of the kernel is is actually 3 times the runtimes of the cpu without the loading times of data take into account.
in the kernel I only used the share memory.
The algo is basically
an input matrix of size 121x256 and ouput a matrix of size 150x150
Please can you just give some hint on where so start optimizing my algorithm.
Thanks
Willer
Lev
April 28, 2010, 12:39pm
2
Yes it is actually the same.
But I don´t use any atomic just share memory and the size is 6.1kb
Si I have as input a matrix of size 121x512 << gridsize(1,121) blockSize(512, 1,1)>>
and the output matrix is 150x150
Lev
April 28, 2010, 2:05pm
4
And what is your gpu and cpu?
Please what do you mean by gpu and cpu ?
the runtimes on gpu is 0.685 s and on cpu is 0.30
and as I increase the si the gpu increase more than the cpu
this is my kernel pseudo code
[codebox]
global void kernel(){
int yy = blockIdx.y ;
int xx = threadIdx.x ;
int thread_index =0;
int nbr_element_per_threads=number_of_element_per_thread ; // offset
int thread_start_index = threadIdx.x*nbr_element_per_threads;
int thread_end_index = nbr_element_per_threads*(threadIdx.x + 1);
encoder_map(0 ,&pos_x,&pos_y); (device function)
offset_col = map_cell_offset_column(p_dev_map_mapping_config, pos_x); (device function)
offset_row = map_cell_offset_row (p_dev_map_mapping_config, pos_y);
for(thread_index= thread_start_index ; thread_index < thread_end_index; thread_index++)
{
column = new column compute (mathematic function invoving 2 cos)
row = new row compute (mathematic function invoving 2 sin)
shared_data[thread_index].value = read gobalmemory with yy and xx;
shared_data[thread_index].row_index = row + offset_row ;
shared_data[thread_index].col_index = column + offset_col ;
}
for(thread_index= thread_start_index ; thread_index < thread_end_index; thread_index++)
{
value = shared_data[thread_index].value ;
row = shared_data[thread_index].row_index ;
column = shared_data[thread_index].col_index ;
if( (row >= 0) && (row <= p_dev_map_mapping_config->row_max)
&& (column >= 0) && (column <= p_dev_map_mapping_config->col_max))
{
cellIndex= (p_dev_map_mapping_config->col_max)*row + column;
if(value < 0)
value= -value;
if( output_dev_pixel_short[cellIndex]< value)
{
output_dev_pixel_short[cellIndex] = value;
}
}
}
}[/codebox]
Lev
April 28, 2010, 2:17pm
6
“Please what do you mean by gpu and cpu ?”
model name.
what is p_dev_map_mapping_config?
global memory pointer?
yes it is actually the global memory pointer and also the value is read in the gloabal momery with the index of the block and thread
Lev
April 28, 2010, 2:31pm
8
is this value the same among all threads in block?
p_dev_map_mapping_config->row_max
p_dev_map_mapping_config: this global memory pointer the the same for all thread
Lev
April 28, 2010, 3:14pm
11
You have a lot of space to improve speed. Just think a bit, what is the point of loading from global memory value which is common for all threads? You may pass it as kernel parameter.
And instead of
if( (row >= 0) && (row <= p_dev_map_mapping_config->row_max) && (column >= 0) && (column <= p_dev_map_mapping_config->col_max))
use min and max at the computation of row and column and so on.