I’m trying to apply a simple blur algorithm, which leaves the outer columns and lines out, but when I change the size of the blocks, the values turn out wrong.
Host Code excerpt
...
// Thread block size
#define BLOCK_SIZE 12
// Matrix dimensions
// (chosen as multiples of the thread block size for simplicity)
#define WA (3 * BLOCK_SIZE) // Matrix A width
#define HA (5 * BLOCK_SIZE) // Matrix A height
...
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(HA / dimBlock.x, WA / dimBlock.y);
testKernel<<< dimGrid, dimBlock >>>(d_oA, d_iA, WA, HA);
...
Kernel
__global__ void
testKernel(float* Ablur, float* A, int wA, int hA)
{
void checkCUDAError(const char* msg);
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
int row = blockDim.y * by + ty;
int col = blockDim.x * bx + tx;
if (row == 0 || col == 0 || (row == (hA - 1)) || (col == (wA - 1)))
Ablur[row * wA + col] = A[row * wA + col];
else
Ablur[row * wA + col] = ((A[row * wA + col] + A[row * wA + col + 1] + A[row * wA + col - 1] + A[(row + 1) * wA + col - 1] + A[(row + 1) * wA + col + 1] + A[(row + 1) * wA + col] + A[(row - 1) * wA + col] + A[(row - 1) * wA + col + 1] + A[(row - 1) * wA + col - 1])/ 9);
}
It works with “#define BLOCK_SIZE 4” and a very large matrix, so the problem has to do with the block size. Is my kernel wrong or my execution parameters?
You’re right, I checked my maximum number of threads/block on the Device Query. Is there any work-around?
By the way, if I use this settings
// Thread block size
#define BLOCK_SIZE 22
// Matrix dimensions
// (chosen as multiples of the thread block size for simplicity)
#define WA (128 * BLOCK_SIZE) // Matrix A width
#define HA (128 * BLOCK_SIZE) // Matrix A height
and
dim3 dimGrid(HA / dimBlock.x, WA / dimBlock.y);
and the host algorithm (which I thought was similar to the device’s)
void computeGold(float* Ablur, const float* A, const unsigned int hA, const unsigned int wA)
{
for (unsigned int i = 0; i < hA; i++)
{
for (unsigned int j = 0; j < wA; j++)
{
if (i == 0 || j == 0 || (i == (hA - 1)) || (j == (wA - 1)))
Ablur[i * wA + j] = A[i * wA + j];
else
Ablur[i * wA + j] = ((A[i * wA + j] + A[i * wA + j + 1] + A[i * wA + j - 1] + A[(i + 1) * wA + j - 1] + A[(i + 1) * wA + j + 1] + A[(i + 1) * wA + j]
+ A[(i - 1) * wA + j] + A[(i - 1) * wA + j + 1] + A[(i - 1) * wA + j - 1]) / 9);
}
}
}
I get 776.938354 ms and 381.094269 ms respectively for GPU and CPU processing times. Is this what I should have expected?
Check out the Box Filter example in the SDK. Their implementation utilizes the fact that the filter is separable to process the rows and columns individually to allow for coalesced memory reads (at least when processing the rows). They also use some clever tricks to reduce the number of operations required to calculate each element.
Also, you may want to look at Chapter 5 - Performance Guidelines in the CUDA programming guide. There are tons of tips on how to optimize your memory usage and execution configuration.