CUDA gemm with shared memory is slower than with global memory

Jiahao_Fan · October 21, 2024, 1:18pm

I try to use shared memory to do gemm but it looks slower than gemm with gloabel memory.The kernal of shared memory is from CUDA docs 3.2.4 shared memory. For the same matrix size 4096*4096,the cubls works fine but shared memory kernel is too low. I want to know whats’ wrong with my code.
4070 laptop 12800HX 32GBmemory win11
nvcc -std=c++17 -arch=sm_89 -g -lcublas -lcudart -G -O3 -o test test.cu
shared mem kernel time 1631.566650ms
cublas time 41.416702ms
naive time 1208.498047ms

This is the gemm kernel with shared memory from CUDA docs 3.2.4 shared memory

// Matrix multiplication kernel called by MatMul()
 __global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
    // Block row and column
    int blockRow = blockIdx.y;
    int blockCol = blockIdx.x;
    // Each thread block computes one sub-matrix Csub of C
    Matrix Csub = GetSubMatrix(C, blockRow, blockCol);
    // Each thread computes one element of Csub
    // by accumulating results into Cvalue
    float Cvalue = 0;
    // Thread row and column within Csub
    int row = threadIdx.y;
    int col = threadIdx.x;
    // Loop over all the sub-matrices of A and B that are
    // required to compute Csub
    // Multiply each pair of sub-matrices together
    // and accumulate the results
    for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) {
        // Get sub-matrix Asub of A
        Matrix Asub = GetSubMatrix(A, blockRow, m);
        // Get sub-matrix Bsub of B
        Matrix Bsub = GetSubMatrix(B, m, blockCol);
        // Shared memory used to store Asub and Bsub respectively
        __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
        __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
        // Load Asub and Bsub from device memory to shared memory
        // Each thread loads one element of each sub-matrix
        As[row][col] = GetElement(Asub, row, col);
        Bs[row][col] = GetElement(Bsub, row, col);
        // Synchronize to make sure the sub-matrices are loaded
        // before starting the computation
        __syncthreads();
        // Multiply Asub and Bsub together
        for (int e = 0; e < BLOCK_SIZE; ++e)
            Cvalue += As[row][e] * Bs[e][col];
        // Synchronize to make sure that the preceding
        // computation is done before loading two new
        // sub-matrices of A and B in the next iteration
        __syncthreads();
    }
    // Write Csub to device memory
    // Each thread writes one element
    SetElement(Csub, row, col, Cvalue);
}

striker159 · October 21, 2024, 1:25pm

Don’t measure performance of code compiled with -G . This flag disables all optimizations in kernel code.

Jiahao_Fan · October 21, 2024, 1:55pm

Thanks a lot!

system · November 4, 2024, 1:56pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CUDA gemm with shared memory is too slow CUDA Programming and Performance cuda	2	30	October 21, 2024
Gemm runs too slow with shared memory CUDA Programming and Performance cuda	2	30	October 21, 2024
Matrix multiply with shared memory works too slow in my code CUDA Programming and Performance	2	25	October 21, 2024
using shared memory CUDA Programming and Performance	6	2929	September 17, 2009
shared memory vs local memory CUDA Programming and Performance	1	8058	December 12, 2011
optimization shared memory fail major speed using shared memory in detriment of global memory CUDA Programming and Performance	3	3667	March 31, 2011
How to improve performance when multiply two matrices with large data in CUDA ? CUDA Programming and Performance	5	3908	March 19, 2014
Matrix Multiplication: Shared vs Global Memory CUDA Programming and Performance	1	3682	June 27, 2011
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	3	3513	July 12, 2011
Shared vs Global Memory impl. of vector matrix mulltiplication CUDA Programming and Performance	3	10673	February 8, 2008

CUDA gemm with shared memory is slower than with global memory

Related topics