I am new to CUDA and have begun with the book “Programming Massively Parallel Processors”. While talking about Global Memory bandwidth, the book discusses about using Shared Memory for Matrix Multiplication to further reduce the memory transactions. The same is explained with the help of concept of TILES. However, after going through the code, it took me some time to understand the loop structure and how the elements are accessed from the matrices in the global memory.
After going through the text, I had this question in mind; Why do we need to use tiles (2D blocks) and make things complicated? Instead we could use 1D blocks and store the elements in the shared memory. In this way, the code too would be simpler. However, I have not tried the same for its performance, but just wanted to know whether my understanding has gone wrong or there is some more logic which I have not been able to get?