I have an application where I need to solve many power flow analysis which require solving many independent sparse linear systems. Each linear system have the same size and structure but different values in both the A matrix and B vector (let say we are solving A*X = B for X). Their size range from 30x30 to 3000x3000 with about 60 to 6000 non-zero elements mostly on the diagonal. I would like to use CUDA cuSPARSE to solve those linear systems on the GPU. If I had source code, I would use one thread block per linear system and my application would require 1000 threadblocks or so (for 1000 independent systems). The problem is that cusparse works at a higher level. I was thinking of two solutions:
- solve each system sequentially on the GPU using different streams, hoping to get concurrent execution, but I doubt the performance will be good.
- merge all my systems in a single large matrix and solve this matrix. This would create a very large matrix with all the original matrices positioned as blocks on the diagonal. Woud the performance be good?
Does anyone has a suggestion. Maybe there is another alternative. Maybe there are source code I could use and solve my system at the blockthread level? Any suggestions is appreciated. Thanks in advance.
cuSparse(as far as I can tell) only has the A*X=B solver for sparse triangular matrices. There are two steps, one analysis and one solve.
[url]cuSPARSE :: CUDA Toolkit Documentation
As far as item number #2, this is something I am working on but for matrix-vector multiplications.
I am replicating matrices of the same size on the diagonal, and representing that block-diagonal matrix as sparse. Then I perform the matrix-vector multiply.
While I am using cuSparse more now, it is not easy to master.
Your matrices are small enough that you should consider cuBlas or even MAGMA.
MAGMA will be the best choice and but most solvers require you do a LU decomp or Cholesky factor first.
[url]http://icl.cs.utk.edu/projectsfiles/magma/docs/magma-v02.pdf[/url]
As far as streams go, they work but you need a high-end GPU and if the independent solves are not small, then it will not be able to parallelize the entire group.
Thank you, I really appreciate your comment. I will try both approaches and comments on the performance here.
Out of necessity I wrote a helper kernel which takes in a small dense matrix, and replicates across the diagonal in CSR sparse format.
Initially I tried creating a large block diag dense matrix(from the small matrices) then converting to CSR format using the cuSparse utilities. The problem was that the conversion process took too long so I wrote my own implementation which is much faster (but also takes advantage of the fact that I know the number of non-zeros ahead of time).
Let me know if you have any interest in that implementation. Really cuSparse is meant to handle very large but sparse data sets, so you probably will be better off using MAGMA, CULA or cuBlas.