Do the tile dimensions and block dimensions have to be same for shared memory matrix multiplication ?

it is generally easier (i did not necessarily say better) to have the tile dimensions == block dimensions, i would think

running with your thought, if the tile dimension != block dimension, then block dimension < tile dimension; i can not perceive the case of tile dimension < block dimension

if the block dimension < tile dimension, you would then have to steer the block over the tile, likely through iteration

why do you ask this?

I am having hard time understanding shared memory (tiled) matrix multiplication

there is likely a matrix transpose buried in there

thus, if you first understand how to do a matrix transpose, and why it is done the way it is done, the matrix multiplication should be easier to follow, i would think

There’s a writeup in the programming guide that covers matrix multiplication using shared memory that may be of interest:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory

Thank You!!!