Hello everyone, I’m trying to write a cuda code using global memory but after profiling my code, i got to know that my memory access pattern is unconcealed (LOW load and store efficiency),

code is as follows

```
//calculate global memory location index
const unsigned int g_Col_Index = __umul24 (blockIdx.x , blockDim.x ) + threadIdx.x ;
const unsigned int g_Row_Index = __umul24 (blockIdx.y , blockDim.y ) + threadIdx.y ;
//Apply Spatial filter
//Boundary Conditions
if (g_Col_Index >= Width || g_Row_Index >= Height )
//say good bye
return ;
//for intermediate calculation
int temp_val = 0 ;
//image boundary condtion
if ( ( g_Row_Index != 0 ) && ( g_Col_Index != 0 ) && (g_Row_Index != Height - 1 ) && ( g_Col_Index != Width - 1 ) )
{
temp_val = (g_In [ __umul24( (g_Row_Index - 1) , Width ) + g_Col_Index-1] ) + ( g_In [ __umul24( (g_Row_Index-1), Width ) + g_Col_Index] <<1 )
+ (g_In [ __umul24( (g_Row_Index-1) , Width ) + g_Col_Index+1] ) + ( g_In [ __umul24( (g_Row_Index) , Width ) + g_Col_Index-1] <<1 )
+ (g_In [ __umul24( (g_Row_Index) , Width ) + g_Col_Index ] <<2 ) + ( g_In [ __umul24( g_Row_Index , Width ) + g_Col_Index+1] <<1 )
+ (g_In [ __umul24( (g_Row_Index+1) , Width ) + g_Col_Index-1] ) + ( g_In [ __umul24( (g_Row_Index+1), Width ) + g_Col_Index] <<1 )
+ (g_In [ __umul24( (g_Row_Index+1) , Width ) + g_Col_Index+1] ) ;
Spatial_Filtered [ __umul24( g_Row_Index , Width ) + g_Col_Index ] = (temp_val + 8)>> 4;
}
```

Grid configuration is as follows

dim3 dimBlock (32,8,1);

dim3 dimGrid (gridSize_x , gridSize_y,1) ;

where gridSize_x and y has obtained based on block size.

How I code it so that memory access patter become coalesced ?