CuBLAS Showing Poor Performance

Hi, I’m trying to multiply a dense 16000x8000 matrix by a 8000x1 vector based on the example from matmulCUBLAS, and I’m getting fairly poor performance compared to example code. All I changed was the height and width of the matrices to this:

matrix_size.uiWA = 8000;
matrix_size.uiHA = 16000;
matrix_size.uiWB = 1;
matrix_size.uiHB = 8000;
matrix_size.uiWC = 1;
matrix_size.uiHC = 16000;

and the cublas call to this:
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, matrix_size.uiHA, matrix_size.uiWB, matrix_size.uiWA, &alpha, d_A, matrix_size.uiHA, d_B, matrix_size.uiHB, &beta, d_C, matrix_size.uiHC);

It’s only showing about 3GFLOPs of performance, while the 320x640 example code was doing about 1.4TFLOPs. Am I doing something wrong, or is a matrix this large just inherently slower?

That looks like a matrix-vector operation, rather than a matrix-matrix operation. I would suggest use of SGEMV rather than SGEMM. Generally speaking, matrix-vector operations are bound by memory throughput, while matrix-matrix operation are compute bound as long as aspect ratio is not extreme.

Thanks. That sped it up from 8ms to 2ms, which I believe is as good as I’m going to get.

Generally speaking, over time computational throughput increases faster than memory throughput. As a consequence, a typical goal is to express higher level matrix algorithms as BLAS3 (matrix-matrix) operations to the largest extent possible. There is extensive literature available, e. g. various ACM and SIAM publications.

On a related note, I tried using cuSPARSE to do this operation because the matrix I’m multiplying with is very sparse (only 13320 non-zero elements in a 1480x1488 matrix). I expected really good results when using cuSparse since the number of multiplies is small, but I’m getting 271us, which seems quite high for a GK110. I build the matrix in CSR format and am only timing the A*v part where both are complex. The cusparse documentation seems to be very little, so I’m not sure if I’m doing it correctly except for the fact that cusparseCcsrmv is returning 0 (no error). I tried timing with events like I normally do, but I was getting a negative answer, so I used Linux’s gettimeofday after synchronizing. Any idea what I’m doing wrong?

You may want to take a look at the sparse matrix vector operations in modern gnu http://nvlabs.github.io/moderngpu/segreduce.html#spmv . There are some cases where it is faster than cuSPARSE.

I do not have first-hand experience with CUSPARSE, but based on previous discussion with the CUSPARSE team, it seems to me that a 1.5K x 1.5K matrix is at the very low end of matrix sizes CUSPARSE is designed to handle.