GEMM Optimization: Achieving Coalesced and Bank Conflict-Free Shared Memory Access

Hi CUDA Experts,

I’m working on a basic GEMM problem using 2D tiling.

From the SASS code, it’s clear that when accessing shared memory, the SA array benefits from memory coalescing. With a datatype of int32 and a 16x16 shared memory tile, four LDS.128 instructions are sufficient to load a row of SA.

However, SB uses non-coalesced access, requiring approximately 4 × (4 × LDS.32) instructions to load a column of SB.

I believe it’s possible to achieve both shared memory coalescing access and bank conflict-free access simultaneously. However, I noticed that my attempt to optimize it has made things worse (see the third picture), introducing significant bank conflicts.

What’s the correct approach to achieve this balance?

Thanks!

template<typename T>
__global__ void __launch_bounds__(1024) gemm_CUDA(T *__restrict__ c, const T *__restrict__ a, const T *__restrict__ b, int M, int N, int K) {
    
    const int bx = blockIdx.x;
    const int by = blockIdx.y;
    const int TILE_SIZE = 16;

    const int tx = threadIdx.x;
    const int ty = threadIdx.y;

    const int col = bx * TILE_SIZE + tx;
    const int row = by * TILE_SIZE + ty;

    __shared__ T SA[TILE_SIZE][TILE_SIZE];
    __shared__ T SB[TILE_SIZE][TILE_SIZE];

    T sum = 0;
    for (int k = 0; k < (K + TILE_SIZE - 1)/TILE_SIZE; ++k) {
        if (row < M && k * TILE_SIZE + tx < K) {
            SA[ty][tx] = a[row * K + k * TILE_SIZE + tx];
        } else {
            SA[ty][tx] = 0;
        }

        if (col < N && k * TILE_SIZE + ty < K) {
            SB[ty][tx] = b[col + (k * TILE_SIZE + ty) * N];
        } else {
            SB[ty][tx] = 0;
        }

        __syncthreads();

        for (int n_k = 0; n_k < TILE_SIZE; ++n_k) {
            sum += SA[ty][n_k] * SB[n_k][tx];
        }
        __syncthreads();
    }

    if (row < M && col < N) {
        c[row * N + col] = sum;
    }
    

}

You seem to be using the word coalesced with respect to shared memory in a way I’ve not heard of and wouldn’t recommend. coalescing is a term applied to global memory access and has a pretty clear definition - it is the grouping of accesses from threads in a warp (more than 1!!) into a single global memory transaction, by the memory controller.

Furthermore, you seem to be highlighting and/or interested in LDS.128 instructions. Those refer to a vector load, which to a first order approximation is a concept relevant to the behavior of a single thread and has nothing directly to do with what is happening in adjacent threads, nor does it have anything to do with the concept of coalescing that is used to describe the behavior of the memory controller in a context of multiple threads accessing global memory.

Are you wanting to achieve vector loads from shared along with bank-conflict-free access to shared?

With respect to bank conflicts, this construct:

has the potential to lead to bank conflicts for threadblock dimensions of less than 32 threads in x. For that reason, as well as (for me) simplicity in thought processes, I wouldn’t try to do this work, at least not the first article, with a threadblock dimension of less than 32 in x.

An LDS.128 instruction suggests that a single thread is able to sensibly use 4 adjacent (32-bit) quantities. This would/could be true if the thread were iterating horizontally in its access pattern. If we unroll the vector dot-product loop, then we see that the accesses to SA move horizontally whereas the accesses to SB move vertically (looking at a single thread’s behavior in the loop). Therefore, it stands to reason that LDS.128 might be easier to achieve with the load from SA (positing loop unrolling) whereas witnessing LDS.128 in the load from SB would not be obvious or sensible - a given thread has no need for adjacent quantities. To head in that direction, the thing that seems more amenable to me would be to make each thread responsible for 4 adjacent columns in the multiply operation(s), but I haven’t worked out how to do that.

Otherwise its not obvious to me how to witness LDS.128 there. And bank conflicts are a separate topic, in my view. With a threadblock (and tile) x dimension of 32, this access:

is by inspection not bank-conflicted, so I would personally reject wasting time with a profiler to convince myself of that (do as you wish, of course). (It is actually not bank-conflicted for other threadblock x dimensions either. But again I’m focusing on x dimension of 32 for simplicity and for the consideration mentioned next.) Regarding this access:

It might appear to at first glance be possibly bank conflicted, because it could be columnar access, depending on ty across the warp. But with a x threadblock dimension of 32, consistent with warp size of 32, we can be assured that ty is constant across the warp, therefore the SA access falls into the broadcast case for shared access - a single location is being accessed warp-wide. And this is not by definition bank-conflicted, and I personally would not bother trying to get the profiler to agree with that, but do as you wish, of course.

For me, I like to occasionally remind people that there is a canonical shared memory tiled matrix multiplication example in the programming guide. It’s not intimately related to any of my comments; just pointing it out for reference.

1 Like

Yes, I read the shared memory tiled matrix multiplication example in the programming guide before.
I think it is almost the same as the code I attached.

I asked this question mainly because I referred to the optimization of Kernel 4 in this repository.
Since I couldn’t successfully run their code, I tried to interpret their so-called “memory coalescing on shared memory” purely based on its description.

However, I think I was misled, and you pointed out the blind spot—there is no concept of “memory coalescing accessing” for shared memory.

I understand that a single thread accessing SA involves horizontally iterating over 4 adjacent 32-bit memory elements. Due to the SIMT architecture of the GPU, this might result in the generation of LDS.128 instructions after compilation. Therefore, I did try to use transposition or other arrangements to make SB also generate LDS.128 instructions. However, as you mentioned, this doesn’t seem reasonable.

Giving a partly solution:

You can change the order of coordinates in SB to SB[tx][ty] or SB[tx][n_k] and make the ty dimension a bit too large and odd (e.g. 33) so that accessing with threads of a warp having different tx does not lead to bank conflicts, although it is not the rightmost dimension.

Now the different n_k are side-by-side in memory. But you still cannot use vector loads, as due to the odd rightmost dimension, the data is not aligned by 64 or 128 bits.

In the end, the accesses of shared memory are optimized for 32 bits per thread and getting vector loads to work (e.g. by shuffling) will probably lead to a slower program.

Perhaps one can make it work with an even more complex layout in shared memory (where each row is not stored in the same layout) with each thread loading different iterations concurrently and later on resorting the loaded data within the threads.

But is this really worth it, if you do not have bank conflicts? Are the non-vector accesses to SB the limiting factor worth a lot of additional local instructions?