I use cusparse and cublas to compute a sparsedense multiplication: C = A’ * B.
A is a M*N sparse matrix
B is a M*S dense matrix
M = 9,633,792, N = 617,004, nnz is 28,901,376, S = 3
I have tried different method to make it faster,

A is stored in CSR format, use cusparseScsrmm to compute A’*B, it takes 180ms

A’ = At is stored in CSR format, use cusparseScsrmm2 to compute At*(B’)’, there transposing B to improve the memory access of matrix B, and according to the document, if op(B) = B^T, only op(A) = A is supported, so I stored At in CSR form in advance, it takes 8ms to transpose B, and 4ms to compute At*(B’)’, 12ms altogether.

A’ = At is stored in CSR format, use cusparseScsrmm to compute A’*B, it takes 8 ms.
A is constant in iteration, so time of operating on A could not be considered, but time of operating on B should be considered. More specifically, A is a Binary Matrix, it has 3 nonzero value is every row.
So I’m wandering is there any method could speed it up? 4ms may be acceptable. For example, to improve the memory access of matrix B but not time consuming. I also considered using constant memory to store A, but cuda seems to have only 64K constant memory, or using texture memory to store B, however it is readonly memory, may be not suitable.