Shared Memory Access - Matrix Multiplication

codeahead · October 24, 2015, 6:57am

I am new to CUDA and have begun with the book “Programming Massively Parallel Processors”. While talking about Global Memory bandwidth, the book discusses about using Shared Memory for Matrix Multiplication to further reduce the memory transactions. The same is explained with the help of concept of TILES. However, after going through the code, it took me some time to understand the loop structure and how the elements are accessed from the matrices in the global memory.

After going through the text, I had this question in mind; Why do we need to use tiles (2D blocks) and make things complicated? Instead we could use 1D blocks and store the elements in the shared memory. In this way, the code too would be simpler. However, I have not tried the same for its performance, but just wanted to know whether my understanding has gone wrong or there is some more logic which I have not been able to get?

[https://developer.nvidia.com/content/programming-massively-parallel-processors-hands-approach]

little_jimmy · October 24, 2015, 9:04am

the way the shared memory is assigned, it is essentially a 1d block, accessed with 2 dimensions

i would think that using 2 dimensions makes access easier
matrix multiplication involves a transpose
a ‘raw’ store of the data in shared memory would result in significant shared memory bank conflicts, given the shared memory transactions to follow as part of the operation
this is countered by adding an offset
2d then makes access easier

Topic		Replies	Views
Matrix multiplication shared memory CUDA Programming and Performance	5	7109	April 6, 2015
A Question from Programming Massively Parallel Processors: A Hands-on Approach CUDA Programming and Performance cuda , kernel	0	622	September 28, 2021
Using more shared memory does not show improvement CUDA Programming and Performance	0	357	November 18, 2020
Matrix Multiplication with Shared Memory CUDA Programming and Performance	0	1345	September 28, 2009
shared memory in 1D array operations CUDA Programming and Performance	2	3633	May 19, 2008
Optimize problem regarding problem size CUDA Programming and Performance	4	6127	May 25, 2011
how to use shared memory CUDA Programming and Performance	6	7672	September 5, 2010
Matrix multiplication performance was decreased due to stall long scoreboard barrier CUDA Programming and Performance cuda	1	374	October 20, 2021
Shared vs Global Memory impl. of vector matrix mulltiplication CUDA Programming and Performance	3	10673	February 8, 2008
How to improve performance when multiply two matrices with large data in CUDA ? CUDA Programming and Performance	5	3908	March 19, 2014

Shared Memory Access - Matrix Multiplication

Related topics