I have a very naive question. Does implementation of threads in a block affect performance? What I mean to ask is suppose I want to launch a kernel with 64 threads, then :
dim3 dimBlock (64,1) or dim3 dimBlock (8,8).
Will there be a difference in performance is these caes? Is this just for efficient managing of threads if your application is also managing something 2D? If yes, then why?
I have the same doubt for the orientation of blocks in a grid.
I always see the orientation of blocks in a grid/threads in a block as just a way of efficiently representing your problem in CUDA terms…
For eg. in C, it is easy to write a matrix multiplication program using 2D arrays, than using a 1D array (which represents a matrix in row-major, let’s say).
However, do note that it is advisable to keep the number of threads in a block to be multiples of warp size.
Just make sure you know how warps are aligned in 2D and 3D blocks. In 1D it’s simple - threads with increasing Idx.xs get into consecutive warps. With more dimensions, you have to know how the several index components translate into a 1D index. It’s written somewhere in the Programming Guide. Other than that, it’s just for convenience.