Hi i’m doing something very similar to a convolution, it can be a viewed as a convolution for the purpose of this discussion.

```
__global__ void
parallelFwdPrjKernel(float* g_dataVol, int zIdx, int dimVolX, int prjDimX, int prjDimY, int kernelWidth, float* g_projectionPrev, float* g_projectionNext, int projIdx, int volStride, int projStride){
__shared__ float Vol[BLOCKSIZEX][BLOCKSIZEY];
__shared__ float ConvVol[BLOCKSIZEX][BLOCKSIZEY];
__shared__ float Proj[BLOCKSIZEX][BLOCKSIZEY];
int y = blockIdx.y*blockDim.y+threadIdx.y;
int x = blockIdx.x*blockDim.x+threadIdx.x;
//int volDimX = dimVolX;
//int i,j, volStride, projStride, k;
int i,j, k;
//volStride = dimVolX*prjDimY; // Index into Z plane
//projStride = prjDimX*prjDimY;
j = (projStride*projIdx)+(prjDimX*y)+x;
Proj[threadIdx.x][threadIdx.y] = g_projectionPrev[j];
for ( k=0; k<kernelWidth; k++){
i = (volStride*zIdx)+(dimVolX*y)+x + offsetsConstant[k];
Vol[threadIdx.x][threadIdx.y] = g_dataVol[i];
__syncthreads();
//ConvVol[tx] = Vol[tx]*Kernel[k];
ConvVol[threadIdx.x][threadIdx.y] = Vol[threadIdx.x][threadIdx.y]*kernelConstant[k];
Proj[threadIdx.x][threadIdx.y] += ConvVol[threadIdx.x][threadIdx.y];
}
g_projectionNext[j] = Proj[threadIdx.x][threadIdx.y];
}
```

I know this is not the most optimized implementation as i am making extra global memory reads that could be avoided by using a padded shared mem, similar to your own example. I did however create the padded shared mem version that only performs 1 global read for central values with a few extra reads at the boundaries all at the cost of thread id checks. I noticed something like 10% performance increase a lot less then i expected. Considering the cost of doing a global fetch and the fact that the occupancy is 0.33. I thought i would also mention that this shader gets called 2500 times could this be my bottleneck?

Any tips or advice is appreciated

Regards Danny