method to speed up cusparse sparse-dense multiplication

I use cusparse and cublas to compute a sparse-dense multiplication: C = A’ * B.

A is a M*N sparse matrix

B is a M*S dense matrix

M = 9,633,792, N = 617,004, nnz is 28,901,376, S = 3

I have tried different method to make it faster,

1. A is stored in CSR format, use cusparseScsrmm to compute A’*B, it takes 180ms

2. A’ = At is stored in CSR format, use cusparseScsrmm2 to compute At*(B’)’, there transposing B to improve the memory access of matrix B, and according to the document, if op(B) = B^T, only op(A) = A is supported, so I stored At in CSR form in advance, it takes 8ms to transpose B, and 4ms to compute At*(B’)’, 12ms altogether.

3. A’ = At is stored in CSR format, use cusparseScsrmm to compute A’*B, it takes 8 ms.

A is constant in iteration, so time of operating on A could not be considered, but time of operating on B should be considered. More specifically, A is a Binary Matrix, it has 3 non-zero value is every row.

So I’m wandering is there any method could speed it up? 4ms may be acceptable. For example, to improve the memory access of matrix B but not time consuming. I also considered using constant memory to store A, but cuda seems to have only 64K constant memory, or using texture memory to store B, however it is read-only memory, may be not suitable.