CuBLAS Showing Poor Performance

shaklee3 · December 18, 2013, 5:00pm

Hi, I’m trying to multiply a dense 16000x8000 matrix by a 8000x1 vector based on the example from matmulCUBLAS, and I’m getting fairly poor performance compared to example code. All I changed was the height and width of the matrices to this:

matrix_size.uiWA = 8000;
matrix_size.uiHA = 16000;
matrix_size.uiWB = 1;
matrix_size.uiHB = 8000;
matrix_size.uiWC = 1;
matrix_size.uiHC = 16000;

and the cublas call to this:
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, matrix_size.uiHA, matrix_size.uiWB, matrix_size.uiWA, &alpha, d_A, matrix_size.uiHA, d_B, matrix_size.uiHB, &beta, d_C, matrix_size.uiHC);

It’s only showing about 3GFLOPs of performance, while the 320x640 example code was doing about 1.4TFLOPs. Am I doing something wrong, or is a matrix this large just inherently slower?

njuffa · December 18, 2013, 7:04pm

That looks like a matrix-vector operation, rather than a matrix-matrix operation. I would suggest use of SGEMV rather than SGEMM. Generally speaking, matrix-vector operations are bound by memory throughput, while matrix-matrix operation are compute bound as long as aspect ratio is not extreme.

shaklee3 · December 18, 2013, 7:22pm

Thanks. That sped it up from 8ms to 2ms, which I believe is as good as I’m going to get.

njuffa · December 18, 2013, 7:28pm

Generally speaking, over time computational throughput increases faster than memory throughput. As a consequence, a typical goal is to express higher level matrix algorithms as BLAS3 (matrix-matrix) operations to the largest extent possible. There is extensive literature available, e. g. various ACM and SIAM publications.

shaklee3 · December 20, 2013, 4:09pm

On a related note, I tried using cuSPARSE to do this operation because the matrix I’m multiplying with is very sparse (only 13320 non-zero elements in a 1480x1488 matrix). I expected really good results when using cuSparse since the number of multiplies is small, but I’m getting 271us, which seems quite high for a GK110. I build the matrix in CSR format and am only timing the A*v part where both are complex. The cusparse documentation seems to be very little, so I’m not sure if I’m doing it correctly except for the fact that cusparseCcsrmv is returning 0 (no error). I tried timing with events like I normally do, but I was getting a negative answer, so I used Linux’s gettimeofday after synchronizing. Any idea what I’m doing wrong?

Gregory_Diamos · December 20, 2013, 7:01pm

You may want to take a look at the sparse matrix vector operations in modern gnu http://nvlabs.github.io/moderngpu/segreduce.html#spmv . There are some cases where it is faster than cuSPARSE.

njuffa · December 20, 2013, 7:42pm

I do not have first-hand experience with CUSPARSE, but based on previous discussion with the CUSPARSE team, it seems to me that a 1.5K x 1.5K matrix is at the very low end of matrix sizes CUSPARSE is designed to handle.

Topic		Replies	Views
cuBLAS related question CUDA Programming and Performance	16	3114	February 6, 2013
CUBLAS sgemv slower than CBLAS for small matrix sizes CUDA Programming and Performance	2	1581	February 1, 2010
why matrixMul from samples so slow? CUDA Programming and Performance	7	5187	June 7, 2010
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	10169	March 24, 2014
SGEMM and SGEMV - large performance difference in cuBLAS CUDA Programming and Performance	1	535	April 7, 2024
CUDA stand-alone version of dense matrix-vector multiplication CUDA Programming and Performance	4	1161	May 4, 2022
CUBLAS - low performance on matrix multiplication CUDA Programming and Performance	7	18308	March 30, 2011
speed up "thin" matrix multiplications in cuBLAS CUDA Programming and Performance	4	1837	January 29, 2016
cusparseSpMM fp32 is slower than cublas cublasSgemm GPU-Accelerated Libraries cublas , cusparse	3	654	April 10, 2023
Disappointing CuBlas performance CUDA Programming and Performance	1	7326	February 26, 2007

CuBLAS Showing Poor Performance

Related topics