Hello, I’m using cuBLAS in my deep learning application and facing a performance issue.

In my application, I do matrix multiplication for forwarding in a neural network,

MatrixA(2x23880) x MatrixB(23880x32), but it is very slow.

The sgemm invocation is like:

cublasSgemm(cublasHandle, CUBLAS_OP_N, CUBLAS_OP_N,

2, 32, 238800, 1.0f, W, 2, x, 238800,

1.0f, y, 2);

nvvp says it is because cuBLAS sets a too small grid size.

On the other hand, the backward is fast where we do multiplication

MatrixA(23880x2) x MatrixB(2x32).

The invocation is like

cublasSgemm(cublasHandle, CUBLAS_OP_T, CUBLAS_OP_N, 238800, 32, 2,

1.0f, W, 2, outGrad, 2, zero, inGrad, 238800);

W is a matrix 2x23880.

In the first case, the sgemm takes around 50ms, but it takes less than 1ms in the second case.

Can anyone give me a suggestion why the performances are so different and how to improve the performance of the first case?