cuBLAS sgemm is slow

Hello, I’m using cuBLAS in my deep learning application and facing a performance issue.

In my application, I do matrix multiplication for forwarding in a neural network,
MatrixA(2x23880) x MatrixB(23880x32), but it is very slow.
The sgemm invocation is like:
cublasSgemm(cublasHandle, CUBLAS_OP_N, CUBLAS_OP_N,
2, 32, 238800, 1.0f, W, 2, x, 238800,
1.0f, y, 2);
nvvp says it is because cuBLAS sets a too small grid size.

On the other hand, the backward is fast where we do multiplication
MatrixA(23880x2) x MatrixB(2x32).
The invocation is like
cublasSgemm(cublasHandle, CUBLAS_OP_T, CUBLAS_OP_N, 238800, 32, 2,
1.0f, W, 2, outGrad, 2, zero, inGrad, 238800);

W is a matrix 2x23880.
In the first case, the sgemm takes around 50ms, but it takes less than 1ms in the second case.

Can anyone give me a suggestion why the performances are so different and how to improve the performance of the first case?

The performance of parallelized GEMM is typically best when matrices are square(-ish). Here we have extreme aspect ratios for the matrices making this case closer to matrix-vector computation; performance anomalies are to be expected.

A classical parallelization technique for GEMM is to use one thread to produce each element of the result matrix. Here we have matrixC (2x32) in the first case, and matrixC (238800x32) in the second case, with the latter providing much more “natural” parallelism.

Even if CUBLAS treats the first case a little bit more intelligently by using multiple threads to compute one result matrix element and using a reduction at the end, the actual parallelism will still tend to be significantly less than in the second case. The small grid size is an expression of that.

Does nvvp show the exact grid dimension and thread-block sizes? Computing the total number of threads for the two cases will likely be instructive.

Thank you for your comment, njuffa.
I understand that there is no easy way to improve the performance.

I checked the grid/block sizes with nvvp.
In the second (slow) case, Grid size=[1,1,1], block size=[8,8,1]
In the second (fast) case, Grid size=[3732,1,1], block size=[8,16,1]

Your comment explains this result.
Thank you again for your kind help.

In practical terms, if you aren’t using CUDA 8.0 yet, you would want to try that, as newer version are likely to me more highly optimized than older version.

While this case may be difficult to accelerate, it might still make sense to file an enhancement request with NVIDIA, especially if there is reason to believe that cases like this are common in deep learning. As you probably are aware, NVIDIA is pursuing the deep learning market with vigor, so they are more likely to be receptive to enhancement requests arising in that context.

Enhancement requests can be filed via the bug reporting form linked directly from the CUDA registered developer website. Simply prefix the bug synopsis with “RFE:” to mark it as an enhancement request.

I’m using CUDA 8. When they update CUDA, I will try and profile again.
I sent a report about this with the prefix “RFE:” and it was filed as #1950006.

Thank you for your suggestion.