cuSPARSE generic SpSM much slower than legacy csrsm2

Hi,

In my application, the sparse triangular solve with right-hand side matrix (spsm, sparse trsm) is the most important kernel. I found, that the new generic cuSPARSE interface (SpSM, link) is much slower than the now deprecated legacy API (csrsm2, link to 11.7.1 version).

In the following zip, you can find an example program source code, a Makefile, and four different matrices used in my usecase. I think the code should be clear for people who know cuSPARSE, so needs no explanation. Just compile and run using make run.

cusparse_trsm_comparison.zip (55.3 MB) (unzips to ~160MB)

On the A100-40GB machine I am using, with CUDA/11.7.0, this is the output I am observing:

System matrix13.txt, size=2744, nrhs=984, nnz=249065:
  Legacy:      1.294336 ms
  Generic:     7.670579 ms

System matrix16.txt, size=4913, nrhs=1450, nnz=593851:
  Legacy:      3.538227 ms
  Generic:    19.877888 ms

System matrix20.txt, size=9261, nrhs=2210, nnz=1468975:
  Legacy:     12.004761 ms
  Generic:    86.814720 ms

System matrix25.txt, size=17576, nrhs=3386, nnz=3605929:
  Legacy:     42.664242 ms
  Generic:   273.142517 ms

I really did not expect that the performance would degrade with a newer cuSPARSE API, definitely not that much.

The performance of the SpSM is so bad for my usecase, that it is faster to convert the sparse system matrix to dense, and use the dense trsm from cuBLAS.

Could you please take a look into this? Why did the performance degrade so much? My guess it that it uses a different algorithm (I use algo=1 in the legacy version), if that’s the case, why is that algorithm not available in the generic version?

Thanks,

Jakub


A little background for the matrices. I use this in a FETI solver. There we have a FEM-like sparse stiffness matrix K, which represents a 3D mesh - each row and column stands for a single degree of freedom (~= node in the mesh), so its size (nrows and ncols) grows with n^3, where n is the size of the domain. Then there is a matrix B, which has the same number of rows, but its number of columns is smaller, it grows approximately with n^2. I use another library to factorize the matrix K into its Cholesky factors, K=L*Lt=Ut*U.

Now, what I need to do, is to solve the system LX=B. This is where the sparse trsm is used.

Since L is a Cholesky factor, it is not nearly as sparse as the original K, but because of fill-in, it gets much denser near the bottom-right of the lower triangle.


And I would also like to mention that I really hate that SpSM requires the buffer to be unmodified between analysis and solve. Especially when the buffer size is so large (approx. matrix+rhs, independent from opA/opB/orderB from my observations). A much better option for me would be to have multiple buffers – persistent, which must be left unmodified, and 2 temporary only for the lifetime of the kernel itself, separate for analysis and solve (best with the option of different tmp buffersizes for analysis and for solve).

Hi @jhomola
Thanks for the reproducer and matrices. I will get back to you after doing more analysis using your matrices.

Hi @malmasri

is there any update on this?

Hi @jhomola,
We conducted a preliminary evaluation and identified potential improvements to cusparseSpSV. I’ll keep you informed of any updates.

Thanks,

2 Likes