Blur in 1D array

Ringworm · December 21, 2009, 10:51pm

I’m trying to apply a simple blur algorithm, which leaves the outer columns and lines out, but when I change the size of the blocks, the values turn out wrong.

Host Code excerpt

...

// Thread block size

#define BLOCK_SIZE 12

// Matrix dimensions

// (chosen as multiples of the thread block size for simplicity)

#define WA (3 * BLOCK_SIZE) // Matrix A width

#define HA (5 * BLOCK_SIZE) // Matrix A height

...

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

dim3 dimGrid(HA / dimBlock.x, WA / dimBlock.y);

testKernel<<< dimGrid, dimBlock >>>(d_oA, d_iA, WA, HA);

...

Kernel

__global__ void

testKernel(float* Ablur, float* A, int wA, int hA)

{

	void checkCUDAError(const char* msg);

	// Block index

	int bx = blockIdx.x;

	int by = blockIdx.y;

	// Thread index

	int tx = threadIdx.x;

	int ty = threadIdx.y;

	int row = blockDim.y * by + ty;

	int col = blockDim.x * bx + tx;

	if (row == 0 || col == 0 || (row == (hA - 1)) || (col == (wA - 1)))

	Ablur[row * wA + col] = A[row * wA + col];

	else

	Ablur[row * wA + col] = ((A[row * wA + col] + A[row * wA + col + 1] + A[row * wA + col - 1] + A[(row + 1) * wA + col - 1] + A[(row + 1) * wA + col + 1] + A[(row + 1) * wA + col]	+ A[(row - 1) * wA + col] + A[(row - 1) * wA + col + 1] + A[(row - 1) * wA + col - 1])/ 9);

}

It works with “#define BLOCK_SIZE 4” and a very large matrix, so the problem has to do with the block size. Is my kernel wrong or my execution parameters?

Thanks.

Ringworm · December 21, 2009, 11:21pm

Whoops, I switched the rows with the columns here:

int row = blockDim.y * by + ty;  int col = blockDim.x * bx + tx;

. Now it works fine with BLOCK_SIZE = 12, but when I change it to 23 or more, I get this error:

cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file <blur.cu>, line 107 : invalid configuration argument.

Can someone help me?

Thanks.

Biasos · December 22, 2009, 4:07am

Whoops, I switched the rows with the columns here:
int row = blockDim.y * by + ty;  int col = blockDim.x * bx + tx;
. Now it works fine with BLOCK_SIZE = 12, but when I change it to 23 or more, I get this error:

cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file <blur.cu>, line 107 : invalid configuration argument.

Can someone help me?

Thanks.

You can only have 512 threads/block (I think). 23*23=529

Cheers!

Ringworm · December 22, 2009, 5:45am

You’re right, I checked my maximum number of threads/block on the Device Query. Is there any work-around?

By the way, if I use this settings

// Thread block size

#define BLOCK_SIZE 22

// Matrix dimensions

// (chosen as multiples of the thread block size for simplicity)

#define WA (128 * BLOCK_SIZE) // Matrix A width

#define HA (128 * BLOCK_SIZE) // Matrix A height

and

dim3 dimGrid(HA / dimBlock.x, WA / dimBlock.y);

and the host algorithm (which I thought was similar to the device’s)

void computeGold(float* Ablur, const float* A, const unsigned int hA, const unsigned int wA)

{

	for (unsigned int i = 0; i < hA; i++)

	{

		for (unsigned int j = 0; j < wA; j++)

			{

			if (i == 0 || j == 0 || (i == (hA - 1)) || (j == (wA - 1)))

				Ablur[i * wA + j] = A[i * wA + j];

			else

				Ablur[i * wA + j] = ((A[i * wA + j] + A[i * wA + j + 1] + A[i * wA + j - 1] + A[(i + 1) * wA + j - 1] + A[(i + 1) * wA + j + 1] + A[(i + 1) * wA + j]

				+ A[(i - 1) * wA + j] + A[(i - 1) * wA + j + 1] + A[(i - 1) * wA + j - 1]) / 9);

		}

	}

}

I get 776.938354 ms and 381.094269 ms respectively for GPU and CPU processing times. Is this what I should have expected?

Biasos · December 22, 2009, 7:13am

You’re right, I checked my maximum number of threads/block on the Device Query. Is there any work-around?

By the way, if I use this settings

// Thread block size

#define BLOCK_SIZE 22

// Matrix dimensions

// (chosen as multiples of the thread block size for simplicity)

#define WA (128 * BLOCK_SIZE) // Matrix A width

#define HA (128 * BLOCK_SIZE) // Matrix A height

and

dim3 dimGrid(HA / dimBlock.x, WA / dimBlock.y);

and the host algorithm (which I thought was similar to the device’s)

void computeGold(float* Ablur, const float* A, const unsigned int hA, const unsigned int wA)

{

	for (unsigned int i = 0; i < hA; i++)

	{

		for (unsigned int j = 0; j < wA; j++)

			{

			if (i == 0 || j == 0 || (i == (hA - 1)) || (j == (wA - 1)))

				Ablur[i * wA + j] = A[i * wA + j];

			else

				Ablur[i * wA + j] = ((A[i * wA + j] + A[i * wA + j + 1] + A[i * wA + j - 1] + A[(i + 1) * wA + j - 1] + A[(i + 1) * wA + j + 1] + A[(i + 1) * wA + j]

				+ A[(i - 1) * wA + j] + A[(i - 1) * wA + j + 1] + A[(i - 1) * wA + j - 1]) / 9);

		}

	}

}

I get 776.938354 ms and 381.094269 ms respectively for GPU and CPU processing times. Is this what I should have expected?

Check out the Box Filter example in the SDK. Their implementation utilizes the fact that the filter is separable to process the rows and columns individually to allow for coalesced memory reads (at least when processing the rows). They also use some clever tricks to reduce the number of operations required to calculate each element.

Also, you may want to look at Chapter 5 - Performance Guidelines in the CUDA programming guide. There are tons of tips on how to optimize your memory usage and execution configuration.

Hope that helps!

Topic		Replies	Views
Big grid size crash on GTX480 CUDA Programming and Performance	5	3430	November 4, 2011
Invalid configuration argument Kernels fail to work with big arrays CUDA Programming and Performance	2	9661	October 6, 2008
Thread Block Size CUDA Programming and Performance	1	916	September 17, 2009
Newbie help on thread blocks CUDA Programming and Performance	22	10845	December 24, 2008
Invalid Configuration Argument CUDA Programming and Performance	7	7555	May 20, 2010
Why Can't it run? CUDA Programming and Performance	1	2762	December 25, 2008
Weird behavior of CUDA CUDA Programming and Performance	6	5636	February 13, 2008
question about setting block_size in matrixMul CUDA Programming and Performance	1	1135	September 5, 2008
Need help understanding kernel function, grid and block CUDA Programming and Performance	6	652	October 12, 2021
An illegal memory access was encountered CUDA Programming and Performance cuda	2	941	December 1, 2022

Blur in 1D array

Related topics