I’m trying to apply a simple blur algorithm, which leaves the outer columns and lines out, but when I change the size of the blocks, the values turn out wrong.

Host Code excerpt

```
...
// Thread block size
#define BLOCK_SIZE 12
// Matrix dimensions
// (chosen as multiples of the thread block size for simplicity)
#define WA (3 * BLOCK_SIZE) // Matrix A width
#define HA (5 * BLOCK_SIZE) // Matrix A height
...
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(HA / dimBlock.x, WA / dimBlock.y);
testKernel<<< dimGrid, dimBlock >>>(d_oA, d_iA, WA, HA);
...
```

Kernel

```
__global__ void
testKernel(float* Ablur, float* A, int wA, int hA)
{
void checkCUDAError(const char* msg);
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
int row = blockDim.y * by + ty;
int col = blockDim.x * bx + tx;
if (row == 0 || col == 0 || (row == (hA - 1)) || (col == (wA - 1)))
Ablur[row * wA + col] = A[row * wA + col];
else
Ablur[row * wA + col] = ((A[row * wA + col] + A[row * wA + col + 1] + A[row * wA + col - 1] + A[(row + 1) * wA + col - 1] + A[(row + 1) * wA + col + 1] + A[(row + 1) * wA + col] + A[(row - 1) * wA + col] + A[(row - 1) * wA + col + 1] + A[(row - 1) * wA + col - 1])/ 9);
}
```

It works with “#define BLOCK_SIZE 4” and a very large matrix, so the problem has to do with the block size. Is my kernel wrong or my execution parameters?

Thanks.