cuSPARSE generic SpSM much slower than legacy csrsm2

jhomola · October 17, 2024, 1:44pm

Hi,

In my application, the sparse triangular solve with right-hand side matrix (spsm, sparse trsm) is the most important kernel. I found, that the new generic cuSPARSE interface (SpSM, link) is much slower than the now deprecated legacy API (csrsm2, link to 11.7.1 version).

In the following zip, you can find an example program source code, a Makefile, and four different matrices used in my usecase. I think the code should be clear for people who know cuSPARSE, so needs no explanation. Just compile and run using make run.

cusparse_trsm_comparison.zip (55.3 MB) (unzips to ~160MB)

On the A100-40GB machine I am using, with CUDA/11.7.0, this is the output I am observing:

System matrix13.txt, size=2744, nrhs=984, nnz=249065:
  Legacy:      1.294336 ms
  Generic:     7.670579 ms

System matrix16.txt, size=4913, nrhs=1450, nnz=593851:
  Legacy:      3.538227 ms
  Generic:    19.877888 ms

System matrix20.txt, size=9261, nrhs=2210, nnz=1468975:
  Legacy:     12.004761 ms
  Generic:    86.814720 ms

System matrix25.txt, size=17576, nrhs=3386, nnz=3605929:
  Legacy:     42.664242 ms
  Generic:   273.142517 ms

I really did not expect that the performance would degrade with a newer cuSPARSE API, definitely not that much.

The performance of the SpSM is so bad for my usecase, that it is faster to convert the sparse system matrix to dense, and use the dense trsm from cuBLAS.

Could you please take a look into this? Why did the performance degrade so much? My guess it that it uses a different algorithm (I use algo=1 in the legacy version), if that’s the case, why is that algorithm not available in the generic version?

Thanks,

Jakub

A little background for the matrices. I use this in a FETI solver. There we have a FEM-like sparse stiffness matrix K, which represents a 3D mesh - each row and column stands for a single degree of freedom (~= node in the mesh), so its size (nrows and ncols) grows with n^3, where n is the size of the domain. Then there is a matrix B, which has the same number of rows, but its number of columns is smaller, it grows approximately with n^2. I use another library to factorize the matrix K into its Cholesky factors, K=L*Lt=Ut*U.

Now, what I need to do, is to solve the system LX=B. This is where the sparse trsm is used.

Since L is a Cholesky factor, it is not nearly as sparse as the original K, but because of fill-in, it gets much denser near the bottom-right of the lower triangle.

And I would also like to mention that I really hate that SpSM requires the buffer to be unmodified between analysis and solve. Especially when the buffer size is so large (approx. matrix+rhs, independent from opA/opB/orderB from my observations). A much better option for me would be to have multiple buffers – persistent, which must be left unmodified, and 2 temporary only for the lifetime of the kernel itself, separate for analysis and solve (best with the option of different tmp buffersizes for analysis and for solve).

malmasri · October 17, 2024, 1:57pm

Hi @jhomola
Thanks for the reproducer and matrices. I will get back to you after doing more analysis using your matrices.

jhomola · February 17, 2025, 3:54pm

Hi @malmasri

is there any update on this?

malmasri · February 18, 2025, 2:09pm

Hi @jhomola,
We conducted a preliminary evaluation and identified potential improvements to cusparseSpSV. I’ll keep you informed of any updates.

Thanks,

Topic		Replies	Views
cuSPARSE <= 11.7 cusparseDcsrsv2 buffer lifetime GPU-Accelerated Libraries cusparse	5	39	October 17, 2024
CUSPARSE much slower than scipy.sparse? CUDA Programming and Performance	8	3194	January 16, 2017
cuSparse: cusparseScsrgemm2 much slower than SpGEMM GPU-Accelerated Libraries cusparse	3	1214	June 24, 2021
Performance Downgrade when changing [deprecated] cusparse<t>csrmm() to cusparseSpMM() GPU-Accelerated Libraries	1	793	August 20, 2019
cusparseSpMM fp32 is slower than cublas cublasSgemm GPU-Accelerated Libraries cublas , cusparse	3	588	April 10, 2023
CuBLAS Showing Poor Performance CUDA Programming and Performance	6	1178	December 20, 2013
Use of CUSPARSE for AX=B CUDA Programming and Performance	11	7756	July 22, 2013
cuSPARSE performance question: csrmm CUDA Programming and Performance	0	704	December 17, 2015
cuSPARSE to solve multiple independent sparse linear systems in parallel GPU-Accelerated Libraries	4	2198	March 3, 2014
Performance cusparseScsrsm_solve GPU-Accelerated Libraries	0	761	December 1, 2014

cuSPARSE generic SpSM much slower than legacy csrsm2

Related topics