would the block dimension influences performance? 1d block vs 2d block?

timothy · October 3, 2008, 6:48pm

Hi there,

It might be a stupid question but I still want to ask: would the block dimension influences the performance?

For example, I want to compute the colors for all vertices in a scene. I put the information(positions, normals…) of all vertices in a 2d texture, then I set a single thread for each vertex using CUDA. I need to put a certain number of threads in one block and this block could be 1d, 2d or 3d. The number of blocks I need is simply: (number of all vertices) / (number of threads in one block). I am wondering whether the block dimension will influence the performance?

Thanks!

-timothy

jack · October 3, 2008, 7:09pm

Well, ideally, you want your thread count to be as high as allowable by your device (determined by the number of registers available / number of registers your kernel uses). Typically though, the thread count per block is set to 128 or 256 so that a greater level of compatibility is possible across different device types.

So, basically, yes, the number of threads in your block has an effect on the performance of your program.

etude · October 3, 2008, 7:47pm

I guess the question was whether the dimension of block, for a given number of threads per block, affects the overall performance. My opinion is yes, as the dimensionality facilitates or encumbers the memory addressing calculation.

jack · October 3, 2008, 8:00pm

I would think that the answer is no, since the compiler translates any ‘2D’ and ‘3D’ calls in the code to the relevant 1D call (since that is how the memory is laid out). The only performance hit might be if you are determining your block size dynamically, where the compiler would have to add extra instructions into the code to handle the dynamic block size.

_Big_Mac · October 3, 2008, 8:02pm

Not always, for example on pre-G200 architectures each SM can process 768 threads simultaneously and the maximum number of threads in a block is 512. Knowing that a block cannot be split among SMs (I believe because of synchronization), you will never do more than one block per SM - therefore using only 512 of 768 threads. This may be costing you performance (but won’t in all cases).

So your block dimensions should be such, that the number of threads per block is a divisor of the number of threads an SM can process. Block size can also affect performance in some cases where threads extensively share data (shared memory) but I can’t think of a problem where more wouldn’t equal better in such case. In fact, I can imagine instances where you would want to have 512 threads per block in such situations (contradictory to what I wrote above about the divisor).

You should avoid small blocks. Smaller than 32 threads and you’re getting below warp size - bad! 64 threads per block is the minimum (and good only if you can execute many concurrent blocks, ie. good occupancy). 128-256 threads per block tends to be a comfy zone for many applications and that’s where I suggest you start but in general it’s not trivial to optimize block dimensions. There’s usually experimentation involved (sometimes a lot), you should try different sizes. Remember to try to keep them multiplies of 32 if you can.

The “spatial” configuration (1D vs 2D vs 3D) is mostly for the programmer’s convenience and not that critical but probably more efficient addressing can save you a few registers? Maybe it also affects texture caching? I don’t know. In most situations you’ll probably want to map the dimensionality to the problem. If you’re doing computations on a 2D array of texels, the obvious solution of 2D blocks seems the most appropriate.

EDIT:

That’s how it works for linear global memory, textures are different.

etude · October 3, 2008, 8:18pm

I don’t think it only depends on whether the block size is dynamic. An example is in image processing, if each block is handling a sub-image, then a 2D block gives a direct blockbase + (threadIdx.x+width*threadIdx.y) addressing, while in 1D block, a conversion from threadIdx.x to 2D within-subimage address (blockbase + (threadIdx.x / sub_width)*width + threadIdx.x % sub_width) would be necessary. It is so simply because the sub_width would be implied in blockDim.y in former case, and has to be given in latter one.

In another word, the dimensionality of block fitting the underlying data model gives more efficient addressing calculation.

timothy · October 6, 2008, 6:01pm

Thanks for your replies guys!
I tested the both cases, and they have almost no performance differences in my program.

-timothy

alex_dubinsky · October 7, 2008, 1:25am

On its own, there won’t be a performance difference. BUT… having 1D blocks often makes it easier for me to think about things like coallescing and divergence. That’s something to keep in mind.

P.S. NO, you don’t need as many threads as possible! Sometimes it helps to have max threads. Usually you only need 256. Frequently you’re better off using registers to accelerate your program and don’t mind having few threads.

Topic		Replies	Views
trouble learning how to set block and max thread size CUDA Programming and Performance	4	1954	January 26, 2011
Block dim discussion 1D vs 2D CUDA Programming and Performance	8	8323	August 14, 2007
Orientation of Threads in a Block. CUDA Programming and Performance	4	1272	September 30, 2009
Relationship between Thread Block dimension and warps CUDA Programming and Performance cuda , kernel	4	424	April 22, 2024
Blocks and Threads CUDA Programming and Performance	1	640	February 7, 2013
How to decide the optimal block size in CUDA CUDA Programming and Performance	4	27360	February 15, 2010
Newbie help on thread blocks CUDA Programming and Performance	22	10593	December 24, 2008
Thread Block Size what difference does it make? CUDA Programming and Performance	6	5325	June 3, 2008
Maximum number of threads How to find maximum number of threads your Card can support CUDA Programming and Performance	16	10247	July 7, 2009
CUDA motivation for multi-dimensional kernel execution CUDA Programming and Performance	6	4106	December 8, 2013

would the block dimension influences performance? 1d block vs 2d block?

Related topics