A question about load shared memory in matrix multiplication

NUM_PER_THREAD_M = 8
NUM_PER_THREAD_N = 1
SUBK = 32
as[32][32]
bs[32][32]
I

            for (int j = 0; j < SUBK; ++j)
            {
                for (int m = 0; m < NUM_PER_THREAD_M; ++m)
                {
                    for (int n = 0; n < NUM_PER_THREAD_N; ++n)
                    {
                        // if (i + j < K)
                        {
                            int idXTmp = (j+tidY + m * DIVIDE_M) % 32;
                            // if ((idY + m * DIVIDE_M) < M && (idX + n * DIVIDE_N) < N && i + j < K)
                            cTmp[m][n] += as[(tidY + m * DIVIDE_M)][j] * bs[j][(tidX + n * DIVIDE_N)];
                        }
                    }
                }
            }

I am trying to implement matrix multiplication using CUDA. I encountered a strange issue when I attempted to replace the j in as[i][j] with idxTmp. In the compute, both the Instructions and Requests for shared memory load doubled. The kernel’s performance also decreased significantly, which I find quite puzzling. I’m wondering if the shared memory is also loaded in sections, so that following the order 0 1 2 3 4 5 6 7 8 would only require two loads, but following the order 1 2 3 4 5 6 7 0 would require three loads. Could this be the reason for the increase in Requests?

PS: When I changed NUM_PER_THREAD_M from 4 to 8, I noticed bank conflicts occurring. I know this approach won’t solve the bank conflict issue, as there should be no bank conflict when reading as. However, I still find this accidental discovery quite strange.

Often it helps to make arrays a bit larger, e.g. as[32][33]. So that a change in either coordinate lands in a different bank.

Is it true that in your code tidX goes from 0…7 and tidY is always 0? Or how are your 32 threads for a warp arranged? If you divide the index into shared memory (converted into 1D) by 32, all your 32 threads should get a different result for the bank number to avoid bank conflicts.

Hi, Curefab, tidY is not always 0.

int tidX = threadIdx.x;
int tidY = threadIdx.y;

Here is my code, CUDA-Learning-Journal/cplusplus/05_multiplication/matrix_multiplication.ipynb at main · Qin-sx/CUDA-Learning-Journal · GitHub.
The test function is matrix_multiplication2.
When I set NUM_PER_THREAD_M = 4, there is no bank conflict.


When I set NUM_PER_THREAD_M = 8, bank conflict occurred.

When I set NUM_PER_THREAD_M = 8 and as[BLOCK_SIZE_M][SUBK+1]``bs[SUBK][BLOCK_SIZE_N+1] , bank conflict occurred and function time has increased a lot.

Could you help me take a look at why this might be happening?
Thank you for your time!

Hi qin_sx,
the shared load bank conflicts could also be from a wrong measurement, as they are only few and sometimes bank conflicts are not counted in an exact way.
You can change to source code view in Compute Nsight and find, at which code lines the bank conflicts happen.

But for generally finding out and solving bank conflicts:
For the first 32 threads ((blockIdx.x/y/z == 0) && ((threadIdx.z * blockDim.y) + threadIdx.y) * blockDim.x + threadIdx.x < 32)), calculate the linearized array index for 4-byte-accesses (for [N][M][K] idx = (n * M + m) * K + k) modulo % 32 for each read or write access separately.
You should get all numbers between 0 and 31 in any order. If some numbers are there multiple times, there are bank conflicts.

You could either directly output from device code (printf), store into some output array of indices, simulate the code or at least the index calculation on the host or do it with pen & paper.

Some common tricks are,

  • resorting the dimensions (e.g. that threadIdx.x and threadIdx.y are at the lowest/rightmost dimensions, even if they belong to separate work packages)
  • creating an array with more elements in the lowest dimension (e.g. 33 instead of 32), it can be accessed row-wise and column-wise
  • instead of [i][j] (if you decide so) always access [i][j^i] or [i][(j+i))%32] (example for [32][32] array, which can be accessed row-wise and column-wise without extending to 33)
  • in really extreme cases, combine two (or more) accesses: the odd threads do one kind of access, the even threads another array position, and then they change the access; afterwards you resort the data in each thread
  • remember, you can store in a different layout, if you also read back in a different layout; and only because for one row, a certain column lands in a certain bank, it does not have to for a different row
  • there are a lot of rules and tricks coming from modulo arithmetics, remainders and common divisors
  • often it is better to have more complicated indices, if you can avoid bank conflicts; especially for programs using shared memory a lot, so that it can be a bottleneck; some (parts of) index calculations also can be reused between loop iterations

This should get you quite far.
If you point to a specific access (single source code line), which cannot be solved in this way, I would help optimize it further.

Thank you for your reply. I will continue to delve deeper into this issue.