Shared Memory Access - Matrix Multiplication

I am new to CUDA and have begun with the book “Programming Massively Parallel Processors”. While talking about Global Memory bandwidth, the book discusses about using Shared Memory for Matrix Multiplication to further reduce the memory transactions. The same is explained with the help of concept of TILES. However, after going through the code, it took me some time to understand the loop structure and how the elements are accessed from the matrices in the global memory.

After going through the text, I had this question in mind; Why do we need to use tiles (2D blocks) and make things complicated? Instead we could use 1D blocks and store the elements in the shared memory. In this way, the code too would be simpler. However, I have not tried the same for its performance, but just wanted to know whether my understanding has gone wrong or there is some more logic which I have not been able to get?


the way the shared memory is assigned, it is essentially a 1d block, accessed with 2 dimensions

i would think that using 2 dimensions makes access easier
matrix multiplication involves a transpose
a ‘raw’ store of the data in shared memory would result in significant shared memory bank conflicts, given the shared memory transactions to follow as part of the operation
this is countered by adding an offset
2d then makes access easier