I wrote a simple CUDA test program that simply adds the pixel values of two images. The image size is currently fixed at 352x288 pixels. At first, I specified the block size as 16x16 and grid size as 22x18, and the code worked fine. However, when I changed block size to 32x32 and grid size to 11x9, the code did not work anymore! The code can be compiled and run, but the result was not correct. Could someone tell me what’s wrong with my code? Thank you.
The GPU code:
__global__ void
GpuAdd(unsigned char* img1, unsigned char* img2, unsigned char* result, int width, int height, int step)
{
int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;
int x = bx * BLOCK_SIZE + tx;
int y = by * BLOCK_SIZE + ty;
int idx = y * step + x;
__shared__ unsigned char RESULT[BLOCK_SIZE][BLOCK_SIZE];
RESULT[ty][tx] = img1[idx] + img2[idx];
__syncthreads();
result[idx] = RESULT[ty][tx];
}
Calling code:
#define BLOCK_SIZE 16 //16 works, but 32 doesn't!
...
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(imgSize.width / dimBlock.x, imgSize.height / dimBlock.y);
GpuAdd<<<dimGrid, dimBlock>>>(d_img1, d_img2, d_res, imgSize.width, imgSize.height, d_pitch);
My GPU:
Name : Quadro FX 4600
Total memory : 804978688 bytes
Max threads per block : 512
Max block dimension : [512 512 64]
Max grid dimension : [65535 65535 1]
Shared memory per block : 16384 bytes
Total constant memory : 65536 bytes
Warp size : 32
Max pitch : 262144