CUDA gemm with shared memory is slower than with global memory

I try to use shared memory to do gemm but it looks slower than gemm with gloabel memory.The kernal of shared memory is from CUDA docs 3.2.4 shared memory. For the same matrix size 4096*4096,the cubls works fine but shared memory kernel is too low. I want to know whats’ wrong with my code.
4070 laptop 12800HX 32GBmemory win11
nvcc -std=c++17 -arch=sm_89 -g -lcublas -lcudart -G -O3 -o test test.cu
shared mem kernel time 1631.566650ms
cublas time 41.416702ms
naive time 1208.498047ms

This is the gemm kernel with shared memory from CUDA docs 3.2.4 shared memory

// Matrix multiplication kernel called by MatMul()
 __global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
    // Block row and column
    int blockRow = blockIdx.y;
    int blockCol = blockIdx.x;
    // Each thread block computes one sub-matrix Csub of C
    Matrix Csub = GetSubMatrix(C, blockRow, blockCol);
    // Each thread computes one element of Csub
    // by accumulating results into Cvalue
    float Cvalue = 0;
    // Thread row and column within Csub
    int row = threadIdx.y;
    int col = threadIdx.x;
    // Loop over all the sub-matrices of A and B that are
    // required to compute Csub
    // Multiply each pair of sub-matrices together
    // and accumulate the results
    for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) {
        // Get sub-matrix Asub of A
        Matrix Asub = GetSubMatrix(A, blockRow, m);
        // Get sub-matrix Bsub of B
        Matrix Bsub = GetSubMatrix(B, m, blockCol);
        // Shared memory used to store Asub and Bsub respectively
        __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
        __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
        // Load Asub and Bsub from device memory to shared memory
        // Each thread loads one element of each sub-matrix
        As[row][col] = GetElement(Asub, row, col);
        Bs[row][col] = GetElement(Bsub, row, col);
        // Synchronize to make sure the sub-matrices are loaded
        // before starting the computation
        __syncthreads();
        // Multiply Asub and Bsub together
        for (int e = 0; e < BLOCK_SIZE; ++e)
            Cvalue += As[row][e] * Bs[e][col];
        // Synchronize to make sure that the preceding
        // computation is done before loading two new
        // sub-matrices of A and B in the next iteration
        __syncthreads();
    }
    // Write Csub to device memory
    // Each thread writes one element
    SetElement(Csub, row, col, Cvalue);
}

Don’t measure performance of code compiled with -G . This flag disables all optimizations in kernel code.

Thanks a lot!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.