Hi CUDA Experts,
I’m working on a basic GEMM problem using 2D tiling.
From the SASS code, it’s clear that when accessing shared memory, the SA
array benefits from memory coalescing. With a datatype of int32
and a 16x16
shared memory tile, four LDS.128
instructions are sufficient to load a row of SA
.
However, SB
uses non-coalesced access, requiring approximately 4 × (4 × LDS.32
) instructions to load a column of SB
.
I believe it’s possible to achieve both shared memory coalescing access and bank conflict-free access simultaneously. However, I noticed that my attempt to optimize it has made things worse (see the third picture), introducing significant bank conflicts.
What’s the correct approach to achieve this balance?
Thanks!
template<typename T>
__global__ void __launch_bounds__(1024) gemm_CUDA(T *__restrict__ c, const T *__restrict__ a, const T *__restrict__ b, int M, int N, int K) {
const int bx = blockIdx.x;
const int by = blockIdx.y;
const int TILE_SIZE = 16;
const int tx = threadIdx.x;
const int ty = threadIdx.y;
const int col = bx * TILE_SIZE + tx;
const int row = by * TILE_SIZE + ty;
__shared__ T SA[TILE_SIZE][TILE_SIZE];
__shared__ T SB[TILE_SIZE][TILE_SIZE];
T sum = 0;
for (int k = 0; k < (K + TILE_SIZE - 1)/TILE_SIZE; ++k) {
if (row < M && k * TILE_SIZE + tx < K) {
SA[ty][tx] = a[row * K + k * TILE_SIZE + tx];
} else {
SA[ty][tx] = 0;
}
if (col < N && k * TILE_SIZE + ty < K) {
SB[ty][tx] = b[col + (k * TILE_SIZE + ty) * N];
} else {
SB[ty][tx] = 0;
}
__syncthreads();
for (int n_k = 0; n_k < TILE_SIZE; ++n_k) {
sum += SA[ty][n_k] * SB[n_k][tx];
}
__syncthreads();
}
if (row < M && col < N) {
c[row * N + col] = sum;
}
}