Code works on 16x16 thread block but not on 32x32?

I wrote a simple CUDA test program that simply adds the pixel values of two images. The image size is currently fixed at 352x288 pixels. At first, I specified the block size as 16x16 and grid size as 22x18, and the code worked fine. However, when I changed block size to 32x32 and grid size to 11x9, the code did not work anymore! The code can be compiled and run, but the result was not correct. Could someone tell me what’s wrong with my code? Thank you.

The GPU code:

__global__ void

GpuAdd(unsigned char* img1, unsigned char* img2, unsigned char* result, int width, int height, int step)


	int bx = blockIdx.x;

	int by = blockIdx.y;

	int tx = threadIdx.x;

	int ty = threadIdx.y;

	int x = bx * BLOCK_SIZE + tx;

	int y = by * BLOCK_SIZE + ty;

	int idx = y * step + x;

	__shared__ unsigned char RESULT[BLOCK_SIZE][BLOCK_SIZE];

	RESULT[ty][tx] = img1[idx] + img2[idx];


	result[idx] = RESULT[ty][tx];


Calling code:

#define BLOCK_SIZE 16 //16 works, but 32 doesn't!


dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

dim3 dimGrid(imgSize.width / dimBlock.x, imgSize.height / dimBlock.y);

GpuAdd<<<dimGrid, dimBlock>>>(d_img1, d_img2, d_res, imgSize.width, imgSize.height, d_pitch);


Name                    : Quadro FX 4600

Total memory            : 804978688 bytes

Max threads per block   : 512

Max block dimension     : [512 512 64]

Max grid dimension      : [65535 65535 1]

Shared memory per block : 16384 bytes

Total constant memory   : 65536 bytes

Warp size               : 32

Max pitch               : 262144

The answer to your question is very simple :

1616=256 threads
32=1024 threads

With CUDA, the maximum number of thread per block is 512.
It’s written on your own post : “Max threads per block : 512”.

I hope this answer will be helpful ;)


I see… thanks garciav!

I have another question though: when I time the Gpu code, on average it runs slower than hand-coded C program! Is there any way to improve the performance? Is it because the pixel type is unsigned char, so memory coalescing does not occur (the programming guide says: “type must be such that sizeof(type) is equal to 4, 8, or 16”). All global memory in my code are allocated using cudaMallocPitch(), so they should fulfill the alignment requirement. If it is true that unsigned char access cannot be coalesced, what can I do to improve my code’s performance, considering that my application deals mainly with 1-byte-per-pixel images? Thank you.

Actually, I’m a noob in CUDA ;)
According to my long (joke) experience, I think that you should consider a type that can allow coalesced reads and writes. Your code will be faster (I hope!).

I am actually not sure whether the slow performance is caused by memory coalescing issue, or is there some other factors that I overlooked that can improve the performance?

When I use coalesced memory instead of not coalesced, I speed-up my code by a factor 8… This is really really usefull…

You’re accessing unsigned char which is 8-bit and cannot be coalesced.