Grid dimensions are not imporant for performance, only the resulting amount of blocks is. The grid is just an abstraction to make your life easier, and so that you need to do less divmod operations in your kernel to determine where in your data set you should operate.
I would like to say that is true, but in my experience there are cases where using a block that is taller and narrow is faster than using a block that is short and wide. I don’t know what the cause of this is, exactly, but I’m sure it is related to the banking of device memory.
Indeed, for block sizes that can certainly hold, especially if the block size changes which memory locations are accessed. But the OP was asking about grid sizes. In which case it all depends on what the program does with the blockIdx.