Thread size in a block should be multiple of warp size?

CCuda · January 11, 2013, 1:25pm

As I read and heard that the thread size assigned to a block should be always multiple of the warp size, 32 in my gpu, otherwise not only the remaining part of the warp goes unused and the performace is droped too since bad memory coalescing. So what is the best way to control/configure this for a user configurable size (the x dim size assigned to the threads in block and grid in that dimension) of screen size passing in the cuda device?

Thanks,
Chester

pasoleatis · January 12, 2013, 4:05pm

I think by trial and error. There not many numbers between 32 and 1024 which multiple of 32. In practice if your code is optimal for a size N*32 it will be optimal also for (N+p)*32, with p some integer. You will just have more blocks.

CCuda · January 15, 2013, 2:06pm

Thanks for the answer I understand if the size is mutiple to 32 (warp size). But if not, say I assigned the grid and block as follow as pass in the kernel, where x_data_size is whatever real x dim bitmap image size on screen,

	m_cuda_threads = dim3(1024/32, 1, 1);							
	m_cuda_grids = dim3((x_data_size + m_cuda_threads.x -1)/m_cuda_threads.x, 1, 1);

So definately most chance the totoal threads is not mutiple to warp size (32), would this will degrade a lot of the preformace? How to eliminate? Have to make the x_data_size mutiple to 32?

Thanks.

alrikai · January 15, 2013, 8:53pm

Your example will still have the total number of threads be a multiple of 32. Your block dimension will be [32 x 1 x 1], and if your grid is any integer value (which it has to be), then your total threads will be (INT * 32) total threads, which is still a multiple of 32.

I think what you’re really getting at is handling bounds checking, and processing data that’s not a direct multiple of the kernel launch parameters. How much this affects your performance depends on the data and your algorithm.
Some potential performance hits that immediatly come to mind are warp divergence (from data-dependant branching), wasted memory (if you employ padding to handle boundary conditions), and just generally extra instructions for bounds checking (I’m sure there’s others)

Sometimes it makes sense to use padding as you alluded to in your post, but sometimes (say, for large matrices) this will take too much memory. You can use shared memory and just load subsets (i.e. a tile that’s a multiple of 32) of the data into your shared memory, and thereby handle your bounds checking (when you get to the last tile with less than 32 valid elements, just fill in those with some identity values, i.e. 0 if you’re summing or 1 if multiplying). Or you can just check that the thread maps to a valid index, and only perform your operation if it does. You can get warp divergence from this, but as long as the conditional operations aren’t really expensive it shouldn’t be too large of a problem.

Hopefully you really were asking about boundary conditions…

CCuda · January 17, 2013, 12:27pm

Yes as long as the threads assigned to multiple to warp size (32), I can do the bound of real data size. Thanks for the help.

Topic		Replies	Views
Block Size.. CUDA Programming and Performance	2	1780	July 11, 2008
How to decide the optimal block size in CUDA CUDA Programming and Performance	4	27544	February 15, 2010
The choose of grid size and block size CUDA Programming and Performance	8	1662	May 8, 2024
warp parallelism How to decreasing the size of a parallel element CUDA Programming and Performance	3	2036	March 17, 2009
Relationship between Thread Block dimension and warps CUDA Programming and Performance cuda , kernel	4	480	April 22, 2024
Blocks and Threads CUDA Programming and Performance	1	641	February 7, 2013
Example of matrix multiplication (max. block_size) CUDA Programming and Performance	2	11579	January 28, 2010
How to determine the Block Size CUDA Programming and Performance	1	5888	September 4, 2009
Thread Block Size what difference does it make? CUDA Programming and Performance	6	5358	June 3, 2008
Best general alignment practices for kernel launches CUDA Programming and Performance	6	824	November 20, 2018

Thread size in a block should be multiple of warp size?

Related topics