I have a very naive question. Does implementation of threads in a block affect performance? What I mean to ask is suppose I want to launch a kernel with 64 threads, then :
dim3 dimBlock (64,1) or dim3 dimBlock (8,8).
Will there be a difference in performance is these caes? Is this just for efficient managing of threads if your application is also managing something 2D? If yes, then why?
I have the same doubt for the orientation of blocks in a grid.