Why is cuBLAS cublasDgemm slower than my naive GEMM kernel?

Hi everyone,

I am testing the performance of matrix multiplication and encountered a confusing result.

I wrote a very simple (naive) GEMM kernel as follows:

__global__ void gemm_naive(int M, int N, int K,
                             DATA_TYPE alpha, DATA_TYPE beta,
                             DATA_TYPE *C, const DATA_TYPE *A, const DATA_TYPE *B) {
    
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (row < M && col < N) {
        DATA_TYPE sum = 0.0;
        for (int i = 0; i < K; i++) {
            sum += A[row * K + i] * B[i * N + col];
        }
        C[row * N + col] = alpha * sum + beta * C[row * N + col];
    }
}

I also tested cuBLAS by calling cublasDgemm with exactly the same matrix sizes and data:

CUDA_CHECK(cudaEventRecord(start));
CUBLAS_CHECK(cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,
                         N_SIZE, M_SIZE, K_SIZE, &alpha,
                         d_B, N_SIZE, d_A, K_SIZE, &beta,
                         d_C_cublas, N_SIZE));
CUDA_CHECK(cudaEventRecord(stop));
CUDA_CHECK(cudaEventSynchronize(stop));

When profiling with nsys, I surprisingly found that cuBLAS was slower than my naive kernel. This is the opposite of what I expected.

So my questions are:

  • Is this an expected phenomenon?

  • Could my way of calling cuBLAS be incorrect?

  • Or are these two implementations not actually performing equivalent computations?

Here are some kernel profiling results from nsys (key fields only):

//cublas:
cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x64_16x4_nn_align1>
Duration: ~348 ms
grid:  <<<632, 12, 1>>>
block: <<<128, 1, 1>>>
Registers Per Thread: 126
Shared Memory executed: 102,400 bytes
Theoretical occupancy: 8.3 %
//Naive kernel:
gemm_naive(int, int, int, double, double, double *, const double *, const double *)
Duration: ~305 ms
grid:  <<<188, 157, 1>>>
block: <<<32, 32, 1>>>
Registers Per Thread: 40
Shared Memory executed: 8,192 bytes
Theoretical occupancy: 66.7 %

I’d suggest providing a complete test case.

  • a minimized but complete code. Something that someone else could compile and run, without having to add anything or change anything.
  • a description of the platform: - CUDA version, OS, GPU
  • compile command used