It might be a stupid question but I still want to ask: would the block dimension influences the performance?
For example, I want to compute the colors for all vertices in a scene. I put the information(positions, normals…) of all vertices in a 2d texture, then I set a single thread for each vertex using CUDA. I need to put a certain number of threads in one block and this block could be 1d, 2d or 3d. The number of blocks I need is simply: (number of all vertices) / (number of threads in one block). I am wondering whether the block dimension will influence the performance?
Well, ideally, you want your thread count to be as high as allowable by your device (determined by the number of registers available / number of registers your kernel uses). Typically though, the thread count per block is set to 128 or 256 so that a greater level of compatibility is possible across different device types.
So, basically, yes, the number of threads in your block has an effect on the performance of your program.
I guess the question was whether the dimension of block, for a given number of threads per block, affects the overall performance. My opinion is yes, as the dimensionality facilitates or encumbers the memory addressing calculation.
I would think that the answer is no, since the compiler translates any ‘2D’ and ‘3D’ calls in the code to the relevant 1D call (since that is how the memory is laid out). The only performance hit might be if you are determining your block size dynamically, where the compiler would have to add extra instructions into the code to handle the dynamic block size.
Not always, for example on pre-G200 architectures each SM can process 768 threads simultaneously and the maximum number of threads in a block is 512. Knowing that a block cannot be split among SMs (I believe because of synchronization), you will never do more than one block per SM - therefore using only 512 of 768 threads. This may be costing you performance (but won’t in all cases).
So your block dimensions should be such, that the number of threads per block is a divisor of the number of threads an SM can process. Block size can also affect performance in some cases where threads extensively share data (shared memory) but I can’t think of a problem where more wouldn’t equal better in such case. In fact, I can imagine instances where you would want to have 512 threads per block in such situations (contradictory to what I wrote above about the divisor).
You should avoid small blocks. Smaller than 32 threads and you’re getting below warp size - bad! 64 threads per block is the minimum (and good only if you can execute many concurrent blocks, ie. good occupancy). 128-256 threads per block tends to be a comfy zone for many applications and that’s where I suggest you start but in general it’s not trivial to optimize block dimensions. There’s usually experimentation involved (sometimes a lot), you should try different sizes. Remember to try to keep them multiplies of 32 if you can.
The “spatial” configuration (1D vs 2D vs 3D) is mostly for the programmer’s convenience and not that critical but probably more efficient addressing can save you a few registers? Maybe it also affects texture caching? I don’t know. In most situations you’ll probably want to map the dimensionality to the problem. If you’re doing computations on a 2D array of texels, the obvious solution of 2D blocks seems the most appropriate.
EDIT:
That’s how it works for linear global memory, textures are different.
I don’t think it only depends on whether the block size is dynamic. An example is in image processing, if each block is handling a sub-image, then a 2D block gives a direct blockbase + (threadIdx.x+width*threadIdx.y) addressing, while in 1D block, a conversion from threadIdx.x to 2D within-subimage address (blockbase + (threadIdx.x / sub_width)*width + threadIdx.x % sub_width) would be necessary. It is so simply because the sub_width would be implied in blockDim.y in former case, and has to be given in latter one.
In another word, the dimensionality of block fitting the underlying data model gives more efficient addressing calculation.
On its own, there won’t be a performance difference. BUT… having 1D blocks often makes it easier for me to think about things like coallescing and divergence. That’s something to keep in mind.
P.S. NO, you don’t need as many threads as possible! Sometimes it helps to have max threads. Usually you only need 256. Frequently you’re better off using registers to accelerate your program and don’t mind having few threads.