I have an application where I need to solve many power flow analysis which require solving many independent sparse linear systems. Each linear system have the same size and structure but different values in both the A matrix and B vector (let say we are solving A*X = B for X). Their size range from 30x30 to 3000x3000 with about 60 to 6000 non-zero elements mostly on the diagonal. I would like to use CUDA cuSPARSE to solve those linear systems on the GPU. If I had source code, I would use one thread block per linear system and my application would require 1000 threadblocks or so (for 1000 independent systems). The problem is that cusparse works at a higher level. I was thinking of two solutions:

- solve each system sequentially on the GPU using different streams, hoping to get concurrent execution, but I doubt the performance will be good.
- merge all my systems in a single large matrix and solve this matrix. This would create a very large matrix with all the original matrices positioned as blocks on the diagonal. Woud the performance be good?

Does anyone has a suggestion. Maybe there is another alternative. Maybe there are source code I could use and solve my system at the blockthread level? Any suggestions is appreciated. Thanks in advance.