Coalesced memory access pattern in Image processing

Hello everyone, I’m trying to write a cuda code using global memory but after profiling my code, i got to know that my memory access pattern is unconcealed (LOW load and store efficiency),
code is as follows

//calculate global memory location index
	const unsigned int g_Col_Index    = __umul24 (blockIdx.x , blockDim.x ) + threadIdx.x ; 
	const unsigned int g_Row_Index    = __umul24 (blockIdx.y , blockDim.y ) + threadIdx.y ;

	//Apply Spatial filter 
	
	//Boundary Conditions 
	if (g_Col_Index >= Width || g_Row_Index >= Height )
		//say good bye
			return ; 
	
	//for intermediate calculation 
	int temp_val = 0 ; 

	//image boundary condtion
	if ( ( g_Row_Index != 0 ) && ( g_Col_Index != 0 ) && (g_Row_Index != Height - 1 ) &&  ( g_Col_Index != Width - 1 ) )
	{
		temp_val =      (g_In [ __umul24( (g_Row_Index - 1) , Width ) + g_Col_Index-1]     )  + ( g_In [ __umul24( (g_Row_Index-1), Width ) + g_Col_Index]   <<1   ) 
			         +  (g_In [ __umul24( (g_Row_Index-1)   , Width ) + g_Col_Index+1]     )  + ( g_In [ __umul24( (g_Row_Index)  , Width ) + g_Col_Index-1] <<1   )  
					 +  (g_In [ __umul24( (g_Row_Index)     , Width ) + g_Col_Index  ] <<2 )  + ( g_In [ __umul24(  g_Row_Index   , Width ) + g_Col_Index+1] <<1   )  
					 +  (g_In [ __umul24( (g_Row_Index+1)   , Width ) + g_Col_Index-1]     )  + ( g_In [ __umul24( (g_Row_Index+1), Width ) + g_Col_Index]   <<1   )   
					 +  (g_In [ __umul24( (g_Row_Index+1)   , Width ) + g_Col_Index+1]     ) ;
		
		Spatial_Filtered [ __umul24( g_Row_Index , Width ) + g_Col_Index ] = (temp_val + 8)>> 4;
	}

Grid configuration is as follows
dim3 dimBlock (32,8,1);
dim3 dimGrid (gridSize_x , gridSize_y,1) ;
where gridSize_x and y has obtained based on block size.

How I code it so that memory access patter become coalesced ?

Have a look at nVidia’s SDK samples.
Basically you should either use textures and/or shared memory instead.

eyal

I just got to know a couple of minutes ago… BTW thanks eyal