I’m wondering about dimensioning of the BLOCK size. I mean what is the best strategy that in general should be adopted when programming a GPU?
Let us suppose that I want to deal with generic sizes of input quantities and I want to exploit at most the computational capabilities of the GPU hardware. Accordingly, in my opinion, I should use the maximum number of thread per block allowed and avoid branches or conditional expression in kernels so that actual parallel execution can be accomplished.
So if the Max thread per block is 512…maybe a block dimensioning of 32x16 or 16x32 should be the right choice. As I give the block size and I want to avoid if instructions in kernels I have to pad data structures involved in order to be integer multiples of the block. In addition, I thought that the bigger are the dimensions of the data structures I need, the lower is the influence on computational burden of padding, as in the worst case data structure are padded with 31 rows and 15 columns (or equivalently 15 rows and 31 col),
I adopted this strategy in writing my code, but I find out that the best performances are attained for a block size 8x8. I can understand that there is a trade of for the block size choice that involves also memory transfers from host to device and viceversa, and also that the best block size depends on the specific code, so no general value can be determined that is suitable for all codes.
But 8x8 leads to 64 threads per block that is much lower of 512 max thread per block allowed by my hardware. So i’m wondering if my programming strategy is correct. And if not, what is the best approach when sizing the block and the grid of a CUDA kernel? Furthermore, is avoiding branches introducing padding in data structures the proper way to manage variable input dimensions?
Any comment about will be really appreciated.