How to determine the Block Size

Beetro · September 4, 2009, 10:08am

Hi all,

I’m wondering about dimensioning of the BLOCK size. I mean what is the best strategy that in general should be adopted when programming a GPU?

Let us suppose that I want to deal with generic sizes of input quantities and I want to exploit at most the computational capabilities of the GPU hardware. Accordingly, in my opinion, I should use the maximum number of thread per block allowed and avoid branches or conditional expression in kernels so that actual parallel execution can be accomplished.

So if the Max thread per block is 512…maybe a block dimensioning of 32x16 or 16x32 should be the right choice. As I give the block size and I want to avoid if instructions in kernels I have to pad data structures involved in order to be integer multiples of the block. In addition, I thought that the bigger are the dimensions of the data structures I need, the lower is the influence on computational burden of padding, as in the worst case data structure are padded with 31 rows and 15 columns (or equivalently 15 rows and 31 col),

I adopted this strategy in writing my code, but I find out that the best performances are attained for a block size 8x8. I can understand that there is a trade of for the block size choice that involves also memory transfers from host to device and viceversa, and also that the best block size depends on the specific code, so no general value can be determined that is suitable for all codes.

But 8x8 leads to 64 threads per block that is much lower of 512 max thread per block allowed by my hardware. So i’m wondering if my programming strategy is correct. And if not, what is the best approach when sizing the block and the grid of a CUDA kernel? Furthermore, is avoiding branches introducing padding in data structures the proper way to manage variable input dimensions?

Any comment about will be really appreciated.

Many thanks,

Pietro

avidday · September 4, 2009, 10:24am

Occupancy is the key here. If you kernel is completely compute cycle limited, then maximizing the number of threads per block is probably the right way to go. On the other hand, most are memory bandwidth limited, in which case maximizing occupancy is probably a better strategy. The more active blocks you can have on a multiprocessor, the more likely it is that there will be a warp of threads that has instructions and data ready to run at any given moment, and the better chance your kernel has of hiding memory and instruction pipeline latency. By reducing the number of threads per block, you are increasing the GPU occupancy. The occupancy calculation spreadsheet in the SDK is a useful tool for understanding more about the effect of block size on occupancy.

Topic		Replies	Views
How to decide the optimal block size in CUDA CUDA Programming and Performance	4	27725	February 15, 2010
How to use "block" and "thread" CUDA Programming and Performance	5	1258	October 16, 2013
Maximum number of threads per block CUDA Programming and Performance	1	463	September 15, 2021
The choose of grid size and block size CUDA Programming and Performance	8	3137	May 8, 2024
increasing blokSize -> Faster or slower CUDA Programming and Performance	4	865	September 12, 2011
maximum threads per block not always used CUDA Programming and Performance	2	754	June 14, 2018
Hide latency CUDA Programming and Performance	3	518	June 9, 2023
optimal block size CUDA Programming and Performance	1	4300	July 12, 2008
General Formula for Thread/Block Ratio CUDA Programming and Performance	1	593	June 2, 2011
Best general alignment practices for kernel launches CUDA Programming and Performance	6	840	November 20, 2018

How to determine the Block Size

Related topics