for (int j = 0; j < SUBK; ++j)
{
for (int m = 0; m < NUM_PER_THREAD_M; ++m)
{
for (int n = 0; n < NUM_PER_THREAD_N; ++n)
{
// if (i + j < K)
{
int idXTmp = (j+tidY + m * DIVIDE_M) % 32;
// if ((idY + m * DIVIDE_M) < M && (idX + n * DIVIDE_N) < N && i + j < K)
cTmp[m][n] += as[(tidY + m * DIVIDE_M)][j] * bs[j][(tidX + n * DIVIDE_N)];
}
}
}
}
I am trying to implement matrix multiplication using CUDA. I encountered a strange issue when I attempted to replace the j in as[i][j] with idxTmp. In the compute, both the Instructions and Requests for shared memory load doubled. The kernel’s performance also decreased significantly, which I find quite puzzling. I’m wondering if the shared memory is also loaded in sections, so that following the order 0 1 2 3 4 5 6 7 8 would only require two loads, but following the order 1 2 3 4 5 6 7 0 would require three loads. Could this be the reason for the increase in Requests?
PS: When I changed NUM_PER_THREAD_M from 4 to 8, I noticed bank conflicts occurring. I know this approach won’t solve the bank conflict issue, as there should be no bank conflict when reading as. However, I still find this accidental discovery quite strange.
Often it helps to make arrays a bit larger, e.g. as[32][33]. So that a change in either coordinate lands in a different bank.
Is it true that in your code tidX goes from 0…7 and tidY is always 0? Or how are your 32 threads for a warp arranged? If you divide the index into shared memory (converted into 1D) by 32, all your 32 threads should get a different result for the bank number to avoid bank conflicts.
Hi qin_sx,
the shared load bank conflicts could also be from a wrong measurement, as they are only few and sometimes bank conflicts are not counted in an exact way.
You can change to source code view in Compute Nsight and find, at which code lines the bank conflicts happen.
But for generally finding out and solving bank conflicts:
For the first 32 threads ((blockIdx.x/y/z == 0) && ((threadIdx.z * blockDim.y) + threadIdx.y) * blockDim.x + threadIdx.x < 32)), calculate the linearized array index for 4-byte-accesses (for [N][M][K]idx = (n * M + m) * K + k) modulo % 32 for each read or write access separately.
You should get all numbers between 0 and 31 in any order. If some numbers are there multiple times, there are bank conflicts.
You could either directly output from device code (printf), store into some output array of indices, simulate the code or at least the index calculation on the host or do it with pen & paper.
Some common tricks are,
resorting the dimensions (e.g. that threadIdx.x and threadIdx.y are at the lowest/rightmost dimensions, even if they belong to separate work packages)
creating an array with more elements in the lowest dimension (e.g. 33 instead of 32), it can be accessed row-wise and column-wise
instead of [i][j] (if you decide so) always access [i][j^i] or [i][(j+i))%32] (example for [32][32] array, which can be accessed row-wise and column-wise without extending to 33)
in really extreme cases, combine two (or more) accesses: the odd threads do one kind of access, the even threads another array position, and then they change the access; afterwards you resort the data in each thread
remember, you can store in a different layout, if you also read back in a different layout; and only because for one row, a certain column lands in a certain bank, it does not have to for a different row
there are a lot of rules and tricks coming from modulo arithmetics, remainders and common divisors
often it is better to have more complicated indices, if you can avoid bank conflicts; especially for programs using shared memory a lot, so that it can be a bottleneck; some (parts of) index calculations also can be reused between loop iterations
This should get you quite far.
If you point to a specific access (single source code line), which cannot be solved in this way, I would help optimize it further.