Hello, I’m using cuBLAS in my deep learning application and facing a performance issue.
In my application, I do matrix multiplication for forwarding in a neural network,
MatrixA(2x23880) x MatrixB(23880x32), but it is very slow.
The sgemm invocation is like:
cublasSgemm(cublasHandle, CUBLAS_OP_N, CUBLAS_OP_N,
2, 32, 238800, 1.0f, W, 2, x, 238800,
1.0f, y, 2);
nvvp says it is because cuBLAS sets a too small grid size.
On the other hand, the backward is fast where we do multiplication
MatrixA(23880x2) x MatrixB(2x32).
The invocation is like
cublasSgemm(cublasHandle, CUBLAS_OP_T, CUBLAS_OP_N, 238800, 32, 2,
1.0f, W, 2, outGrad, 2, zero, inGrad, 238800);
W is a matrix 2x23880.
In the first case, the sgemm takes around 50ms, but it takes less than 1ms in the second case.
Can anyone give me a suggestion why the performances are so different and how to improve the performance of the first case?