Thread size in a block should be multiple of warp size?

As I read and heard that the thread size assigned to a block should be always multiple of the warp size, 32 in my gpu, otherwise not only the remaining part of the warp goes unused and the performace is droped too since bad memory coalescing. So what is the best way to control/configure this for a user configurable size (the x dim size assigned to the threads in block and grid in that dimension) of screen size passing in the cuda device?

Thanks,
Chester

I think by trial and error. There not many numbers between 32 and 1024 which multiple of 32. In practice if your code is optimal for a size N*32 it will be optimal also for (N+p)*32, with p some integer. You will just have more blocks.

Thanks for the answer I understand if the size is mutiple to 32 (warp size). But if not, say I assigned the grid and block as follow as pass in the kernel, where x_data_size is whatever real x dim bitmap image size on screen,

	m_cuda_threads = dim3(1024/32, 1, 1);							
	m_cuda_grids = dim3((x_data_size + m_cuda_threads.x -1)/m_cuda_threads.x, 1, 1);		

So definately most chance the totoal threads is not mutiple to warp size (32), would this will degrade a lot of the preformace? How to eliminate? Have to make the x_data_size mutiple to 32?

Thanks.

Your example will still have the total number of threads be a multiple of 32. Your block dimension will be [32 x 1 x 1], and if your grid is any integer value (which it has to be), then your total threads will be (INT * 32) total threads, which is still a multiple of 32.

I think what you’re really getting at is handling bounds checking, and processing data that’s not a direct multiple of the kernel launch parameters. How much this affects your performance depends on the data and your algorithm.
Some potential performance hits that immediatly come to mind are warp divergence (from data-dependant branching), wasted memory (if you employ padding to handle boundary conditions), and just generally extra instructions for bounds checking (I’m sure there’s others)

Sometimes it makes sense to use padding as you alluded to in your post, but sometimes (say, for large matrices) this will take too much memory. You can use shared memory and just load subsets (i.e. a tile that’s a multiple of 32) of the data into your shared memory, and thereby handle your bounds checking (when you get to the last tile with less than 32 valid elements, just fill in those with some identity values, i.e. 0 if you’re summing or 1 if multiplying). Or you can just check that the thread maps to a valid index, and only perform your operation if it does. You can get warp divergence from this, but as long as the conditional operations aren’t really expensive it shouldn’t be too large of a problem.

Hopefully you really were asking about boundary conditions…

Yes as long as the threads assigned to multiple to warp size (32), I can do the bound of real data size. Thanks for the help.