Hello, I have a problem in cusparseDcsrmv with symmetric matrix.

cusparseDcsrmv(handle, cusparseOperation.CUSPARSE_OPERATION_NON_TRANSPOSE, matrixSize, matrixSize, 1,
descra, d_csrValA, d_rowPtrA, d_colIndA, d_x, 0, d_y);
if I use cusparseSetMatType(descra, CUSPARSE_MATRIX_TYPE_GENERAL); it works in 10 times faster then I use cusparseSetMatType(descra, cusparseMatrixType.CUSPARSE_MATRIX_TYPE_SYMMETRIC);

The library team provided the following answer (I slightly edited the last paragraph). I hope this helps:

The algorithm used to perform sparse matrix-vector multiplication for symmetric and unsymmetric matrices is different. For symmetric matrices only the matrix upper or lower part is stored in memory, so we perform the multiplication with the stored part (in the same way as we do the regular sparse-matrix vector multiplication) and then we perform a multiplication with the transpose of the stored part (ignoring the diagonal). Here is a more precise example that assumes the lower part of the matrix is stored in memory.

Let A= L +D +U where D is diagonal, L is strictly lower and U is strictly upper triangular part of the matrix (notice that for symmetric matrices U=L^{T}). So we perform the regular multiplication with L+D and then we add the multiplication with U=L^{T} to the result: y = Ax = (L+D)x + (L^{T})x.

The multiplication with the transpose of the matrix involves the use of atomics, which makes the algorithm relatively slow (consequently the user experience is not surprising). To achieve the highest possible performance, it would probably be best to store the full matrix in memory (even though it is symmetric) and then call the regular sparse-matrix-vector multiplication on it.

I checked with the CUDA library team, and they supplied the following explanation:

For the nonsymmetric sparse matrix-vector multiplication the operation y = A*x is performed (A is stored explicitly).

For the symmetric matrix only its lower (or upper) triangular part of the matrix A is stored. We can write y = A*x = (L+D)*x + L^{T}*x, where A = (L+D) + L^{T} with L being strictly lower triangular part of the matrix and D being the diagonal. Since only L+D is stored, we need to perform an operation with the matrix transpose (L^{T}) to compute the resulting vector y. This operation uses atomics because matrix rows need to be interpreted as columns, and as multiple threads are traversing them, different threads might add values to the same memory location in the resulting vector y. This is the reason why the matrix-vector multiplication with the matrix transpose and symmetric matrix is slower than with the nonsymmetric matrix.

The best way to speed up the computation (unless you are limited by memory) would be to transform the symmetric into the nonsymmetric matrix and call the appropriate CUSPARSE routine on it.