cusparse matrix vector multiplication algorithm

Hello, I have a problem in cusparseDcsrmv with symmetric matrix. It works in 10 times faster when I use cusparseSetMatType(descra, CUSPARSE_MATRIX_TYPE_GENERAL) then I use cusparseSetMatType(descra,CUSPARSE_MATRIX_TYPE_SYMMETRIC). I got an answer in my last topic : “The multiplication with the transpose of the matrix involves the use of atomics, which makes the algorithm relatively slow (consequently the user experience is not surprising). To achieve the highest possible performance, it would probably be best to store the full matrix in memory (even though it is symmetric) and then call the regular sparse-matrix-vector multiplication on it”. But in my opinion it is possible to write an algorithm without matrix transposition:

public void mult() {
for (int i = 0; i<rowPtrA.length-1; i++) {
int origin = rowPtrA[i];
int length = rowPtrA[i+1] - origin;
double vi = vector[i];
for (int j = 0; j < length; j++) {
double val = csrValA[origin+j];
result[i] += valvector[colIndA[origin+j]];
if (j > 0) result[colIndA[origin+j]] += val
vi;
}
}
}
When I multiply element of symmetric matrix on appropriate element of vector I take appropriate element in other triangle of symmetric matrix and multiply on appropriate element of vector. So I dont compute matrix transposition.
I thik it is more efficient and faster algorithm then algorithm Ax=(L+D+U)x=(L+D)x+(L^{T})x. Why dont you use this algorithm for working with symmetric matrix?

I checked with the CUDA library team, and they supplied the following explanation:

  1. For the nonsymmetric sparse matrix-vector multiplication the operation y = A*x is performed (A is stored explicitly).

  2. For the symmetric matrix only its lower (or upper) triangular part of the matrix A is stored. We can write y = A*x = (L+D)*x + L^{T}*x, where A = (L+D) + L^{T} with L being strictly lower triangular part of the matrix and D being the diagonal. Since only L+D is stored, we need to perform an operation with the matrix transpose (L^{T}) to compute the resulting vector y. This operation uses atomics because matrix rows need to be interpreted as columns, and as multiple threads are traversing them, different threads might add values to the same memory location in the resulting vector y. This is the reason why the matrix-vector multiplication with the matrix transpose and symmetric matrix is slower than with the nonsymmetric matrix.

The best way to speed up the computation (unless you are limited by memory) would be to transform the symmetric into the nonsymmetric matrix and call the appropriate CUSPARSE routine on it.