Why is cuBLAS cublasDgemm slower than my naive GEMM kernel?

pengjh0111 · September 14, 2025, 5:19am

Hi everyone,

I am testing the performance of matrix multiplication and encountered a confusing result.

I wrote a very simple (naive) GEMM kernel as follows:

__global__ void gemm_naive(int M, int N, int K,
                             DATA_TYPE alpha, DATA_TYPE beta,
                             DATA_TYPE *C, const DATA_TYPE *A, const DATA_TYPE *B) {
    
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (row < M && col < N) {
        DATA_TYPE sum = 0.0;
        for (int i = 0; i < K; i++) {
            sum += A[row * K + i] * B[i * N + col];
        }
        C[row * N + col] = alpha * sum + beta * C[row * N + col];
    }
}

I also tested cuBLAS by calling cublasDgemm with exactly the same matrix sizes and data:

CUDA_CHECK(cudaEventRecord(start));
CUBLAS_CHECK(cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,
                         N_SIZE, M_SIZE, K_SIZE, &alpha,
                         d_B, N_SIZE, d_A, K_SIZE, &beta,
                         d_C_cublas, N_SIZE));
CUDA_CHECK(cudaEventRecord(stop));
CUDA_CHECK(cudaEventSynchronize(stop));

When profiling with nsys, I surprisingly found that cuBLAS was slower than my naive kernel. This is the opposite of what I expected.

So my questions are:

Is this an expected phenomenon?
Could my way of calling cuBLAS be incorrect?
Or are these two implementations not actually performing equivalent computations?

Here are some kernel profiling results from nsys (key fields only):

//cublas:
cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x64_16x4_nn_align1>
Duration: ~348 ms
grid:  <<<632, 12, 1>>>
block: <<<128, 1, 1>>>
Registers Per Thread: 126
Shared Memory executed: 102,400 bytes
Theoretical occupancy: 8.3 %
//Naive kernel:
gemm_naive(int, int, int, double, double, double *, const double *, const double *)
Duration: ~305 ms
grid:  <<<188, 157, 1>>>
block: <<<32, 32, 1>>>
Registers Per Thread: 40
Shared Memory executed: 8,192 bytes
Theoretical occupancy: 66.7 %

Robert_Crovella · September 15, 2025, 2:21pm

I’d suggest providing a complete test case.

a minimized but complete code. Something that someone else could compile and run, without having to add anything or change anything.
a description of the platform: - CUDA version, OS, GPU
compile command used

Topic		Replies	Views
Slow CUDA SGEMM CUDA Programming and Performance	5	756	September 15, 2022
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	1028	August 23, 2018
Why is my cublas so slow and is there anything I can do to fix it? CUDA Programming and Performance	1	1546	June 27, 2018
cublasSgemv slower than expected GPU-Accelerated Libraries	7	1084	December 22, 2020
CuBLAS matrix multiplication is slower than the naive one CUDA Programming and Performance cuda	8	1194	September 6, 2023
Performance query Odd results profiling GPU speed of matrix multiplication using cublas CUDA Programming and Performance	1	1499	February 12, 2010
benchmark CUDA CuBLas and OpenCL CUDA Programming and Performance	13	28208	February 1, 2011
Why is my cuBLAS code extremely slow compared to TensorFlow's version? GPU-Accelerated Libraries	1	1410	February 21, 2018
Comparing DGEMM - Intel MKL and Cublas Legacy PGI Compilers	3	21050	September 9, 2010
cublas problem with very big matrixes and cublasDgemm slow CUDA Programming and Performance	2	1062	February 23, 2017

Why is cuBLAS cublasDgemm slower than my naive GEMM kernel?

Related topics