Bad performance of cublas for extremely small matrix multiplication?

ytz15 · July 22, 2020, 3:41pm

Hi,

I’m trying to accelerate a code which does many small matrix (3*3) multiplications. I compared the performance of CPU serial code, CPU OpenMP code, cuBLAS (strided batched gemm), and OpenACC. From the results, I see the worst performance from cuBLAS, which is tens of times slower than the CPU OpenMP version. It’s even slower than the CPU serial version.

For cuBLAS and OpenACC, I only copy the data once to the device and then do multiple iterations of matrix multiplications. For OpenACC I used PGI compiler, but I used Intel compiler for the others. My hardware specifications are, Intel Xeon W-2145 CPU and Quadro P2200 GPU.

Is it normal to see this low performance from cuBLAS for this type of calculations ?
Thanks,

mnicely · September 14, 2020, 2:37pm

Can you provide reproducer code/snippet?

ytz15 · September 17, 2020, 9:36pm

Hi, here’s the function for cuBLAS,

typedef float mytype;
void GPU_MultiStridedBatch(mytype *M, mytype *N, mytype *P, size_t pr, size_t pc, size_t mc, mytype alpha, mytype beta, int num_mat, int niter)
{
mytype *devM, *devN, *devP;

size_t p_size =sizeof(mytype) *pr*pc;
size_t m_size =sizeof(mytype) *pr*mc;
size_t n_size =sizeof(mytype) *mc*pc;

cudaMalloc((void**)&devM, m_size*num_mat );
cudaMalloc((void**)&devN, n_size*num_mat );
cudaMalloc((void**)&devP, p_size*num_mat );

cudaMemcpy(devM, M, m_size*num_mat , cudaMemcpyHostToDevice);
cudaMemcpy(devN, N, n_size*num_mat , cudaMemcpyHostToDevice);

cublasHandle_t myhandle;
cublasStatus_t cublas_result;

cublas_result = cublasCreate(&myhandle);
assert(cublas_result == CUBLAS_STATUS_SUCCESS);

for (int i=0; i<niter; i++){
cublas_result = cublasSgemmStridedBatched(myhandle, CUBLAS_OP_N, CUBLAS_OP_N
  , pr, pc, mc
  , &alpha, devM, pr, pr*mc, devN, mc, mc*pc
  , &beta, devP, pr, pr*pc
  , num_mat);
}
assert(cublas_result == CUBLAS_STATUS_SUCCESS);


cudaMemcpy(P, devP, p_size*num_mat, cudaMemcpyDeviceToHost);
cudaFree(devM);
cudaFree(devN);
cudaFree(devP);
cublasDestroy(myhandle);

}

And for the CPU version,

struct element{
mytype m[ROWM][COLM], n[COLM][COLN], p[ROWM][COLN];
};

int CPU_multi(int num_mat, int niter)
{
struct element* elms;
elms =new element[num_mat];

for(int i=0; i<num_mat; i++){
	for(int j=0; j<ROWM; j++)
		for(int k=0; k<COLM; k++)
			elms[i].m[j][k] = 3.0f; 
	for(int j=0; j<COLM; j++)
		for(int k=0; k<COLN; k++)
			elms[i].n[j][k] = 2.0f; 
	for(int j=0; j<ROWM; j++)
		for(int k=0; k<COLN; k++)
			elms[i].p[j][k] = 0.0f; 
}

double t1 = omp_get_wtime();
for(int it = 0; it<niter; it++)
	for(int k =0; k<num_mat; k++)
		for(int i=0; i< ROWM; i++){
			for(int j=0; j<COLN; j++){
				elms[k].p[i][j] = 0.0f;
				for(int m=0; m<COLM; m++){
					elms[k].p[i][j] += elms[k].m[i][m] * elms[k].n[m][j];
				}
			}
		}
double t2 = omp_get_wtime();

printf("CPU serial time : %e seconds \n", t2-t1);
delete elms;
return 0;

}

Thanks!

johan.edstedt · May 1, 2024, 2:59pm

@mnicely is there any progress on speeding up cuBLAS for small matrices? It has really big inpact on downstream frameworks that use it (e.g. pytorch)

mnicely · May 1, 2024, 4:05pm

Have you tried using cublasDx? As some point, small matrices become latency/memory bound on larger GPUs.

Topic		Replies	Views
CUBLAS sgemv slower than CBLAS for small matrix sizes CUDA Programming and Performance	2	1545	February 1, 2010
CUBLAS Configuration The use of CUBLAS for small matrix CUDA Programming and Performance	3	3770	April 4, 2007
CUBLAS - low performance on matrix multiplication CUDA Programming and Performance	7	18257	March 30, 2011
why matrixMul from samples so slow? CUDA Programming and Performance	7	5134	June 7, 2010
Disappointing CuBlas performance CUDA Programming and Performance	1	7298	February 26, 2007
How to speed-up matrix multiplication using CUBLAS? CUDA Programming and Performance	6	7564	September 1, 2010
cublasZgemmBatched low performance 2x2 matrices; how to increase performance? GPU-Accelerated Libraries	9	1358	February 20, 2015
cublas large matrix multiplication large matrices won't compute CUDA Programming and Performance	4	3566	January 17, 2008
Odd timing results Intel MKL vs. My GPU implementation CUDA Programming and Performance	5	3609	July 24, 2008
Slow execution of simpleCublas CUDA Programming and Performance	0	4894	March 25, 2010

Bad performance of cublas for extremely small matrix multiplication?

Related topics