All the following code is doing moving a window of size 2N+1 x 2N+1 about every pixel in a Left and Right image and calculating the mean pixel value in that window. Where the thread size is (32,8,1) and grid size is (32,96,1).

```
__constant__ int N;
texture <float4, 2, cudaReadModeNormalizedFloat> tex_recImageL;
texture <float4, 2, cudaReadModeNormalizedFloat> tex_recImageR;
__global__ void
cuStereoCorr(int* RES_H, int* RES_V, float* ResultDisp, float* MeanL, float* MeanR, float* STDL, float* STDR)
{
int x = __mul24(blockIdx.x, blockDim.x) + threadIdx.x;
int y = __mul24(blockIdx.y, blockDim.y) + threadIdx.y;
int loc = __mul24(y, 1024) + x;
int u = 0;
int v = 0;
float4 ValueL;
float4 ValueR;
float4 MeanL_Temp;
float4 MeanR_Temp;
#pragma unroll
for(u=-N; u<=N; u++)
{
#pragma unroll
for(v=-N; v<=N; v++)
{
ValueL = tex2D(tex_recImageL,x+u,y+v);
MeanL_Temp.x = (MeanL_Temp.x + ValueL.x)/((float)(2.*N+1));
MeanL_Temp.y = (MeanL_Temp.y + ValueL.y)/((float)(2.*N+1));
MeanL_Temp.z = (MeanL_Temp.z + ValueL.z)/((float)(2.*N+1));
ValueR = tex2D(tex_recImageR,x+u,y+v);
MeanR_Temp.x = (MeanR_Temp.x + ValueR.x)/((float)(2.*N+1));
MeanR_Temp.y = (MeanR_Temp.y + ValueR.y)/((float)(2.*N+1));
MeanR_Temp.z = (MeanR_Temp.z + ValueR.z)/((float)(2.*N+1));
}
}
MeanL[loc]=.11*MeanL_Temp.x+.59*MeanL_Temp.y+.3*MeanL_Temp.z;
MeanR[loc]=.11*MeanR_Temp.x+.59*MeanR_Temp.y+.3*MeanR_Temp.z;
}
```

If I comment out the last 2 lines I get 100% occupancy but leaving them gives 33%. I don’t understand since I have 0 incoherent loads or stores. If you have any ideas or thoughts on more efficient ways to do this please let me know.

Thanks!