How to determine the Block Size

Hi all,

I’m wondering about dimensioning of the BLOCK size. I mean what is the best strategy that in general should be adopted when programming a GPU?

Let us suppose that I want to deal with generic sizes of input quantities and I want to exploit at most the computational capabilities of the GPU hardware. Accordingly, in my opinion, I should use the maximum number of thread per block allowed and avoid branches or conditional expression in kernels so that actual parallel execution can be accomplished.

So if the Max thread per block is 512…maybe a block dimensioning of 32x16 or 16x32 should be the right choice. As I give the block size and I want to avoid if instructions in kernels I have to pad data structures involved in order to be integer multiples of the block. In addition, I thought that the bigger are the dimensions of the data structures I need, the lower is the influence on computational burden of padding, as in the worst case data structure are padded with 31 rows and 15 columns (or equivalently 15 rows and 31 col),

I adopted this strategy in writing my code, but I find out that the best performances are attained for a block size 8x8. I can understand that there is a trade of for the block size choice that involves also memory transfers from host to device and viceversa, and also that the best block size depends on the specific code, so no general value can be determined that is suitable for all codes.

But 8x8 leads to 64 threads per block that is much lower of 512 max thread per block allowed by my hardware. So i’m wondering if my programming strategy is correct. And if not, what is the best approach when sizing the block and the grid of a CUDA kernel? Furthermore, is avoiding branches introducing padding in data structures the proper way to manage variable input dimensions?

Any comment about will be really appreciated.

Many thanks,

Pietro

Occupancy is the key here. If you kernel is completely compute cycle limited, then maximizing the number of threads per block is probably the right way to go. On the other hand, most are memory bandwidth limited, in which case maximizing occupancy is probably a better strategy. The more active blocks you can have on a multiprocessor, the more likely it is that there will be a warp of threads that has instructions and data ready to run at any given moment, and the better chance your kernel has of hiding memory and instruction pipeline latency. By reducing the number of threads per block, you are increasing the GPU occupancy. The occupancy calculation spreadsheet in the SDK is a useful tool for understanding more about the effect of block size on occupancy.