I’ve been working with some sparse matrices recently and while profiling my code, I saw that my call to csrmv() with TRANSPOSE parameter is taking twice the amount of time of the operation with NON_TRANSPOSE. One solution to reduce that would be to store the CSR and the CSC format and use CSC format with NON_TRANSPOSE (as suggested in doc), but the matrices are already taking a lot of memory, so is there another way to reduce the time for TRANSPOSE using “smart” operations ?
I also have another question : is there a way to profile the amount of extra memory that csrmv() is using when doing the transpose operation. Documentation tells us it uses “extra storage”. Right now I’m getting the memory usage with cudaMemGetInfo(), so I can’t know what’s happening while csrmv() runs. I can get the information before or after.