Blur in 1D array

I’m trying to apply a simple blur algorithm, which leaves the outer columns and lines out, but when I change the size of the blocks, the values turn out wrong.

Host Code excerpt

...

// Thread block size

#define BLOCK_SIZE 12

// Matrix dimensions

// (chosen as multiples of the thread block size for simplicity)

#define WA (3 * BLOCK_SIZE) // Matrix A width

#define HA (5 * BLOCK_SIZE) // Matrix A height

...

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

dim3 dimGrid(HA / dimBlock.x, WA / dimBlock.y);

testKernel<<< dimGrid, dimBlock >>>(d_oA, d_iA, WA, HA);

...

Kernel

__global__ void

testKernel(float* Ablur, float* A, int wA, int hA)

{

	void checkCUDAError(const char* msg);

	// Block index

	int bx = blockIdx.x;

	int by = blockIdx.y;

	// Thread index

	int tx = threadIdx.x;

	int ty = threadIdx.y;

	int row = blockDim.y * by + ty;

	int col = blockDim.x * bx + tx;

	if (row == 0 || col == 0 || (row == (hA - 1)) || (col == (wA - 1)))

	Ablur[row * wA + col] = A[row * wA + col];

	else

	Ablur[row * wA + col] = ((A[row * wA + col] + A[row * wA + col + 1] + A[row * wA + col - 1] + A[(row + 1) * wA + col - 1] + A[(row + 1) * wA + col + 1] + A[(row + 1) * wA + col]	+ A[(row - 1) * wA + col] + A[(row - 1) * wA + col + 1] + A[(row - 1) * wA + col - 1])/ 9);

}

It works with “#define BLOCK_SIZE 4” and a very large matrix, so the problem has to do with the block size. Is my kernel wrong or my execution parameters?

Thanks.

Whoops, I switched the rows with the columns here:

int row = blockDim.y * by + ty;  int col = blockDim.x * bx + tx;

. Now it works fine with BLOCK_SIZE = 12, but when I change it to 23 or more, I get this error:

cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file <blur.cu>, line 107 : invalid configuration argument.

Can someone help me?

Thanks.

You can only have 512 threads/block (I think). 23*23=529

Cheers!

You’re right, I checked my maximum number of threads/block on the Device Query. Is there any work-around?

By the way, if I use this settings

// Thread block size

#define BLOCK_SIZE 22

// Matrix dimensions

// (chosen as multiples of the thread block size for simplicity)

#define WA (128 * BLOCK_SIZE) // Matrix A width

#define HA (128 * BLOCK_SIZE) // Matrix A height

and

dim3 dimGrid(HA / dimBlock.x, WA / dimBlock.y);

and the host algorithm (which I thought was similar to the device’s)

void computeGold(float* Ablur, const float* A, const unsigned int hA, const unsigned int wA)

{

	for (unsigned int i = 0; i < hA; i++)

	{

		for (unsigned int j = 0; j < wA; j++)

			{

			if (i == 0 || j == 0 || (i == (hA - 1)) || (j == (wA - 1)))

				Ablur[i * wA + j] = A[i * wA + j];

			else

				Ablur[i * wA + j] = ((A[i * wA + j] + A[i * wA + j + 1] + A[i * wA + j - 1] + A[(i + 1) * wA + j - 1] + A[(i + 1) * wA + j + 1] + A[(i + 1) * wA + j]

				+ A[(i - 1) * wA + j] + A[(i - 1) * wA + j + 1] + A[(i - 1) * wA + j - 1]) / 9);

		}

	}

}

I get 776.938354 ms and 381.094269 ms respectively for GPU and CPU processing times. Is this what I should have expected?

Check out the Box Filter example in the SDK. Their implementation utilizes the fact that the filter is separable to process the rows and columns individually to allow for coalesced memory reads (at least when processing the rows). They also use some clever tricks to reduce the number of operations required to calculate each element.

Also, you may want to look at Chapter 5 - Performance Guidelines in the CUDA programming guide. There are tons of tips on how to optimize your memory usage and execution configuration.

Hope that helps!