Is it possible to use L1 cache instead of shared memory when implementing blocked matmuls in CUDA

Hi, I’ve read many articles about optimizing matmul in CUDA, and all of them implement an algorithm like this: first load a tile of matrix A, and a tile of matrix B to shared memory, and the load from shared memory for the actual computation.

I was wondering, for the same tiled implementation, is it possible to directly load from global memory, and assume that the loaded tiles will reside in L1 cache for the computation of this tile? Basically everything is the same as the shared memory version, but relying on L1 cache rather than shared memory. This would be more like an optimized GEMM implementation on CPUs.

Here’s the code that explains the version that uses shared memory:

void kernel(float* a, float* b, float* c, int M, int N, int K) {
    int tx = threadIdx.x;
    int ty = threadIdx.y;
    int m = blockIdx.y * BM;   // starting index of the block
    int n = blockIdx.x * BN;   
    int _m, _n, _k;  // inner dimensions

    __shared__ float a_block[BM][BK];
    __shared__ float b_block[BK][BN];

    float sum = 0;
    for (int k = 0; k < K; k += BK) {
        _m = ty;  
        _k = tx;  
        a_block[_m][_k] = a[(m+_m)*K + k+_k];

        _k = ty;
        _n = tx;
        b_block[_k][_n] = b[(k+_k)*N + n+_n];


        _m = ty;
        _n = tx;
        for (int _k = 0; _k < BK; _k++) {
            sum += a_block[_m][_k] * b_block[_k][_n];


    _m = ty;
    _n = tx;
    c[(m+_m)*N + n+_n] = sum;

What I am proposing is 1. keep the blocked structure, but directly load from global memory so that the tiles reside in L1 cache for reuse. That way the implementation can be simplified, taking advantge of the HW-managed L1 cache, which is as fast as shared memory. Any problem with that? Thanks!

Is it possible? Sure. Is it going to be faster than a version using shared memory? Unlikely.

One way to think about shared memory is as a programmer-managed cache, with full control in the hands of the programmer. With caches it is never guaranteed that a particular piece of data resides in the cache at any particular time, one can only assess the likelihood of that being the case depending on fill policy, replacement policy, write policy, etc.

If you restrict yourself to GEMM with square matrices and a single transpose mode (N, T is a popular choice for this), you could easily explore the alternatives yourself in a couple of days, and develop the cache blocking approach further if early results appear promising: extend to matrices of any aspect ratio, all transpose modes, multiple GPU architectures. CUBLAS is mature and can be used as a performance yardstick.

If you perform a thorough literature search (Google Scholar is a good starting point), you may even find that someone explored this earlier, and benefit from their findings.

1 Like

I see thank you very much for the explanation! I’ve wondered why CPUs don’t use a L1-cache/shared memory combined approach, and let the programmer explicitly place data in the cache. It seems to be very helpful to have both automatically HW managed cache and programmer controlled cache, like shared memory in GPUs so that when we do need explicit cache control it’s at our disposal. Any reasons CPUs are not designed that way?

My memory is hazy but to my recollection there have been (and probably still are) CPUs with scratch pad memory {instead of | in addition to} caches. As I recall, mostly in the embedded and signal processing space, with the goal of providing full programmer control and completely predictable timing. But caches are something that accelerates the vast majority of memory accesses automagically, so is more comfortable for programmers to use while being sufficient for good performance across a large set of use cases.

GPUs used for CUDA started without a classical L1/L2 cache hierarchy and featured only constant cache and texture cache inherited from graphics. As GPU compute was new and unproven as a market, only minimal extension to existing hardware was warranted, and the cheapest way to provide something cache-like was shared memory. Now GPU compute (across all application areas from HPC to AI) is a giant market, major investments are possible, and GPUs have adopted many classical CPU features, while at the same time CPUs have been equipped with GPU-like features (see the Fugaku supercomputer, for example).

I think it is fair to say that the importance of shared memory in CUDA programming has decreased with the advent of L1/L2 caches of competitive size in GPUs. For use cases requiring peak performance, shared memory can still be important due to the programmer control it provides. CUDA programmers are free to choose, just like they can choose the method of data movement between host and device: classical explicit copies under full programmer control or unified memory management magic, with the former often preferable where peak performance is required.

That’s very well said! Thank you for offering the historical/investment point of view. It makes much more sense why CPUs/GPUs are designed the way they are when we looked at how each started and evolved. And I agree with the final remarks as well, i.e. CUDA programmers are free to choose different methods based on their needs. Thanks again for your insights! Very helpful!