Bad performance of cublas for extremely small matrix multiplication?


I’m trying to accelerate a code which does many small matrix (3*3) multiplications. I compared the performance of CPU serial code, CPU OpenMP code, cuBLAS (strided batched gemm), and OpenACC. From the results, I see the worst performance from cuBLAS, which is tens of times slower than the CPU OpenMP version. It’s even slower than the CPU serial version.

For cuBLAS and OpenACC, I only copy the data once to the device and then do multiple iterations of matrix multiplications. For OpenACC I used PGI compiler, but I used Intel compiler for the others. My hardware specifications are, Intel Xeon W-2145 CPU and Quadro P2200 GPU.

Is it normal to see this low performance from cuBLAS for this type of calculations ?

Can you provide reproducer code/snippet?

Hi, here’s the function for cuBLAS,

typedef float mytype;
void GPU_MultiStridedBatch(mytype *M, mytype *N, mytype *P, size_t pr, size_t pc, size_t mc, mytype alpha, mytype beta, int num_mat, int niter)
mytype *devM, *devN, *devP;

size_t p_size =sizeof(mytype) *pr*pc;
size_t m_size =sizeof(mytype) *pr*mc;
size_t n_size =sizeof(mytype) *mc*pc;

cudaMalloc((void**)&devM, m_size*num_mat );
cudaMalloc((void**)&devN, n_size*num_mat );
cudaMalloc((void**)&devP, p_size*num_mat );

cudaMemcpy(devM, M, m_size*num_mat , cudaMemcpyHostToDevice);
cudaMemcpy(devN, N, n_size*num_mat , cudaMemcpyHostToDevice);

cublasHandle_t myhandle;
cublasStatus_t cublas_result;

cublas_result = cublasCreate(&myhandle);
assert(cublas_result == CUBLAS_STATUS_SUCCESS);

for (int i=0; i<niter; i++){
cublas_result = cublasSgemmStridedBatched(myhandle, CUBLAS_OP_N, CUBLAS_OP_N
  , pr, pc, mc
  , &alpha, devM, pr, pr*mc, devN, mc, mc*pc
  , &beta, devP, pr, pr*pc
  , num_mat);
assert(cublas_result == CUBLAS_STATUS_SUCCESS);

cudaMemcpy(P, devP, p_size*num_mat, cudaMemcpyDeviceToHost);


And for the CPU version,

struct element{
mytype m[ROWM][COLM], n[COLM][COLN], p[ROWM][COLN];

int CPU_multi(int num_mat, int niter)
struct element* elms;
elms =new element[num_mat];

for(int i=0; i<num_mat; i++){
	for(int j=0; j<ROWM; j++)
		for(int k=0; k<COLM; k++)
			elms[i].m[j][k] = 3.0f; 
	for(int j=0; j<COLM; j++)
		for(int k=0; k<COLN; k++)
			elms[i].n[j][k] = 2.0f; 
	for(int j=0; j<ROWM; j++)
		for(int k=0; k<COLN; k++)
			elms[i].p[j][k] = 0.0f; 

double t1 = omp_get_wtime();
for(int it = 0; it<niter; it++)
	for(int k =0; k<num_mat; k++)
		for(int i=0; i< ROWM; i++){
			for(int j=0; j<COLN; j++){
				elms[k].p[i][j] = 0.0f;
				for(int m=0; m<COLM; m++){
					elms[k].p[i][j] += elms[k].m[i][m] * elms[k].n[m][j];
double t2 = omp_get_wtime();

printf("CPU serial time : %e seconds \n", t2-t1);
delete elms;
return 0;