A question about load shared memory in matrix multiplication

qin_sx · November 10, 2024, 5:08am

NUM_PER_THREAD_M = 8
NUM_PER_THREAD_N = 1
SUBK = 32
as[32][32]
bs[32][32]
I

            for (int j = 0; j < SUBK; ++j)
            {
                for (int m = 0; m < NUM_PER_THREAD_M; ++m)
                {
                    for (int n = 0; n < NUM_PER_THREAD_N; ++n)
                    {
                        // if (i + j < K)
                        {
                            int idXTmp = (j+tidY + m * DIVIDE_M) % 32;
                            // if ((idY + m * DIVIDE_M) < M && (idX + n * DIVIDE_N) < N && i + j < K)
                            cTmp[m][n] += as[(tidY + m * DIVIDE_M)][j] * bs[j][(tidX + n * DIVIDE_N)];
                        }
                    }
                }
            }

I am trying to implement matrix multiplication using CUDA. I encountered a strange issue when I attempted to replace the j in as[i][j] with idxTmp. In the compute, both the Instructions and Requests for shared memory load doubled. The kernel’s performance also decreased significantly, which I find quite puzzling. I’m wondering if the shared memory is also loaded in sections, so that following the order 0 1 2 3 4 5 6 7 8 would only require two loads, but following the order 1 2 3 4 5 6 7 0 would require three loads. Could this be the reason for the increase in Requests?

PS: When I changed NUM_PER_THREAD_M from 4 to 8, I noticed bank conflicts occurring. I know this approach won’t solve the bank conflict issue, as there should be no bank conflict when reading as. However, I still find this accidental discovery quite strange.

Curefab · November 11, 2024, 12:03pm

Often it helps to make arrays a bit larger, e.g. as[32][33]. So that a change in either coordinate lands in a different bank.

Is it true that in your code tidX goes from 0…7 and tidY is always 0? Or how are your 32 threads for a warp arranged? If you divide the index into shared memory (converted into 1D) by 32, all your 32 threads should get a different result for the bank number to avoid bank conflicts.

qin_sx · November 30, 2024, 3:38pm

Hi, Curefab, tidY is not always 0.

int tidX = threadIdx.x;
int tidY = threadIdx.y;

Here is my code, CUDA-Learning-Journal/cplusplus/05_multiplication/matrix_multiplication.ipynb at main · Qin-sx/CUDA-Learning-Journal · GitHub.
The test function is matrix_multiplication2.
When I set NUM_PER_THREAD_M = 4, there is no bank conflict.

When I set NUM_PER_THREAD_M = 8, bank conflict occurred.

When I set NUM_PER_THREAD_M = 8 and as[BLOCK_SIZE_M][SUBK+1]``bs[SUBK][BLOCK_SIZE_N+1] , bank conflict occurred and function time has increased a lot.

Could you help me take a look at why this might be happening?
Thank you for your time!

Curefab · November 30, 2024, 7:41pm

Hi qin_sx,
the shared load bank conflicts could also be from a wrong measurement, as they are only few and sometimes bank conflicts are not counted in an exact way.
You can change to source code view in Compute Nsight and find, at which code lines the bank conflicts happen.

But for generally finding out and solving bank conflicts:
For the first 32 threads ((blockIdx.x/y/z == 0) && ((threadIdx.z * blockDim.y) + threadIdx.y) * blockDim.x + threadIdx.x < 32)), calculate the linearized array index for 4-byte-accesses (for [N][M][K] idx = (n * M + m) * K + k) modulo % 32 for each read or write access separately.
You should get all numbers between 0 and 31 in any order. If some numbers are there multiple times, there are bank conflicts.

You could either directly output from device code (printf), store into some output array of indices, simulate the code or at least the index calculation on the host or do it with pen & paper.

Some common tricks are,

resorting the dimensions (e.g. that threadIdx.x and threadIdx.y are at the lowest/rightmost dimensions, even if they belong to separate work packages)
creating an array with more elements in the lowest dimension (e.g. 33 instead of 32), it can be accessed row-wise and column-wise
instead of [i][j] (if you decide so) always access [i][j^i] or [i][(j+i))%32] (example for [32][32] array, which can be accessed row-wise and column-wise without extending to 33)
in really extreme cases, combine two (or more) accesses: the odd threads do one kind of access, the even threads another array position, and then they change the access; afterwards you resort the data in each thread
remember, you can store in a different layout, if you also read back in a different layout; and only because for one row, a certain column lands in a certain bank, it does not have to for a different row
there are a lot of rules and tricks coming from modulo arithmetics, remainders and common divisors
often it is better to have more complicated indices, if you can avoid bank conflicts; especially for programs using shared memory a lot, so that it can be a bottleneck; some (parts of) index calculations also can be reused between loop iterations

This should get you quite far.
If you point to a specific access (single source code line), which cannot be solved in this way, I would help optimize it further.

qin_sx · December 1, 2024, 12:19am

Thank you for your reply. I will continue to delve deeper into this issue.

Topic		Replies	Views
Bank conflict of tiled matrix multiplication CUDA Programming and Performance cuda , kernel	1	750	May 17, 2023
Bank conflicts confusion for tiled matrix multiplication CUDA Programming and Performance	1	371	February 27, 2024
Shared memory bank conflict CUDA Programming and Performance	2	1656	October 1, 2021
Resolve 1D shared memory bank conflict with paddling CUDA Programming and Performance cuda , kernel	9	430	September 1, 2024
Shared memory bank conflict CUDA Programming and Performance	4	639	July 30, 2025
Bank conflicts with 2D shared mem array Resolving bank conflicts CUDA Programming and Performance	1	2070	July 18, 2008
Better performance with smaller block size & bank conflicts !!! CUDA Programming and Performance	2	561	September 23, 2017
Shared memory bank conflict reordering CUDA Programming and Performance	1	142	July 16, 2025
How to understand the bank conflict of shared_mem CUDA Programming and Performance	16	16521	November 19, 2025
Unexpected shared memory bank conflict. CUDA Programming and Performance	3	1190	July 30, 2025

A question about load shared memory in matrix multiplication

Related topics