Hi everyone,
I am testing the performance of matrix multiplication and encountered a confusing result.
I wrote a very simple (naive) GEMM kernel as follows:
__global__ void gemm_naive(int M, int N, int K,
DATA_TYPE alpha, DATA_TYPE beta,
DATA_TYPE *C, const DATA_TYPE *A, const DATA_TYPE *B) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < M && col < N) {
DATA_TYPE sum = 0.0;
for (int i = 0; i < K; i++) {
sum += A[row * K + i] * B[i * N + col];
}
C[row * N + col] = alpha * sum + beta * C[row * N + col];
}
}
I also tested cuBLAS by calling cublasDgemm with exactly the same matrix sizes and data:
CUDA_CHECK(cudaEventRecord(start));
CUBLAS_CHECK(cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,
N_SIZE, M_SIZE, K_SIZE, &alpha,
d_B, N_SIZE, d_A, K_SIZE, &beta,
d_C_cublas, N_SIZE));
CUDA_CHECK(cudaEventRecord(stop));
CUDA_CHECK(cudaEventSynchronize(stop));
When profiling with nsys, I surprisingly found that cuBLAS was slower than my naive kernel. This is the opposite of what I expected.
So my questions are:
-
Is this an expected phenomenon?
-
Could my way of calling cuBLAS be incorrect?
-
Or are these two implementations not actually performing equivalent computations?
Here are some kernel profiling results from nsys (key fields only):
//cublas:
cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x64_16x4_nn_align1>
Duration: ~348 ms
grid: <<<632, 12, 1>>>
block: <<<128, 1, 1>>>
Registers Per Thread: 126
Shared Memory executed: 102,400 bytes
Theoretical occupancy: 8.3 %
//Naive kernel:
gemm_naive(int, int, int, double, double, double *, const double *, const double *)
Duration: ~305 ms
grid: <<<188, 157, 1>>>
block: <<<32, 32, 1>>>
Registers Per Thread: 40
Shared Memory executed: 8,192 bytes
Theoretical occupancy: 66.7 %