Code works on 16x16 thread block but not on 32x32?

sunafets · March 10, 2008, 6:44am

I wrote a simple CUDA test program that simply adds the pixel values of two images. The image size is currently fixed at 352x288 pixels. At first, I specified the block size as 16x16 and grid size as 22x18, and the code worked fine. However, when I changed block size to 32x32 and grid size to 11x9, the code did not work anymore! The code can be compiled and run, but the result was not correct. Could someone tell me what’s wrong with my code? Thank you.

The GPU code:

__global__ void

GpuAdd(unsigned char* img1, unsigned char* img2, unsigned char* result, int width, int height, int step)

{

	int bx = blockIdx.x;

	int by = blockIdx.y;

	int tx = threadIdx.x;

	int ty = threadIdx.y;

	int x = bx * BLOCK_SIZE + tx;

	int y = by * BLOCK_SIZE + ty;

	int idx = y * step + x;

	__shared__ unsigned char RESULT[BLOCK_SIZE][BLOCK_SIZE];

	RESULT[ty][tx] = img1[idx] + img2[idx];

	__syncthreads();

	result[idx] = RESULT[ty][tx];

}

Calling code:

#define BLOCK_SIZE 16 //16 works, but 32 doesn't!

...

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

dim3 dimGrid(imgSize.width / dimBlock.x, imgSize.height / dimBlock.y);

GpuAdd<<<dimGrid, dimBlock>>>(d_img1, d_img2, d_res, imgSize.width, imgSize.height, d_pitch);

My GPU:

Name                    : Quadro FX 4600

Total memory            : 804978688 bytes

Max threads per block   : 512

Max block dimension     : [512 512 64]

Max grid dimension      : [65535 65535 1]

Shared memory per block : 16384 bytes

Total constant memory   : 65536 bytes

Warp size               : 32

Max pitch               : 262144

garciav · March 10, 2008, 8:02am

The answer to your question is very simple :

1616=256 threads
3232=1024 threads

With CUDA, the maximum number of thread per block is 512.
It’s written on your own post : “Max threads per block : 512”.

I hope this answer will be helpful ;)

Vincent

sunafets · March 10, 2008, 8:08am

I see… thanks garciav!

I have another question though: when I time the Gpu code, on average it runs slower than hand-coded C program! Is there any way to improve the performance? Is it because the pixel type is unsigned char, so memory coalescing does not occur (the programming guide says: “type must be such that sizeof(type) is equal to 4, 8, or 16”). All global memory in my code are allocated using cudaMallocPitch(), so they should fulfill the alignment requirement. If it is true that unsigned char access cannot be coalesced, what can I do to improve my code’s performance, considering that my application deals mainly with 1-byte-per-pixel images? Thank you.

garciav · March 10, 2008, 8:41am

Actually, I’m a noob in CUDA ;)
According to my long (joke) experience, I think that you should consider a type that can allow coalesced reads and writes. Your code will be faster (I hope!).

sunafets · March 10, 2008, 8:56am

I am actually not sure whether the slow performance is caused by memory coalescing issue, or is there some other factors that I overlooked that can improve the performance?

garciav · March 10, 2008, 9:01am

When I use coalesced memory instead of not coalesced, I speed-up my code by a factor 8… This is really really usefull…

AndreiB · March 12, 2008, 12:45pm

You’re accessing unsigned char which is 8-bit and cannot be coalesced.

Topic		Replies	Views
Threadblock size of 32x16 gets better performance than 16x32 Also one case where 16x16 is better tha CUDA Programming and Performance	2	7102	August 27, 2009
Performance question or bug in a program? CUDA Programming and Performance	3	3116	June 29, 2008
3 x 16 thread block runs faster than 16 x 3 why is that? CUDA Programming and Performance	8	10015	April 3, 2007
16x16 VS 32x8 large difference -> bug? CUDA Programming and Performance	6	1051	May 2, 2011
increasing blokSize -> Faster or slower CUDA Programming and Performance	4	889	September 12, 2011
Coalescing CUDA Programming and Performance	6	3588	May 12, 2008
Here are my timing results, not impressive. Help. CUDA Programming and Performance	5	7027	January 30, 2008
[beginner] memory access CUDA Programming and Performance	7	11298	November 3, 2010
Naive Matrix Multiplication: Confused about Optimal Threadblock Size CUDA Programming and Performance cuda	8	398	June 11, 2024
Memory Coalescing CUDA Programming and Performance	7	2897	July 29, 2009

Code works on 16x16 thread block but not on 32x32?

Related topics