I have about 400,000,000
linear system of equations Ax=b
in which all A
s are of size 4096×3
and b
s are of size 4096×1
.
I have constructed a large sparse matrix, SX=B
, where S is the sparse coefficient matrix consisting of many A
s (they are populated at the diagonal of S
) . The solution vector X
and right hand side vector B
are formed by vertically stacking x
s and b
s.
S = [
A1 0 ... 0
0 A2 0 ... 0
...
0 ... 0 AN
]
B = [
b1
b2
...
bN
]
I need to solve for solution using SVD/pseudoinverse.
Here are my understandings from my research on how to solve the newly formed SX=B
using SVD/pseudoinverse:

cuSolver
doesn’t allow the matrices to be solved in a sparse format. 
cuSolver
only allows the QR or LU factorization in the sparse format. 
cuSolver
allows to find the approximateSVD
over a large batch of matrices, which can be then used to solve the system of equations.
I have implemented the batched approximate SVD and tested it on the 10,000
matrices of size 4096 x 3
. And this took 3.6
s on my laptop. This means to solve all 400,000,000
system of equations, it will take 40 hours
. The CPU version of this using Intel MKL to solve 10,000
systems of Ax=b
in a sequential manner takes 10,000×60us=60ms
(GPU is 6 times slower).
I would appreciate it if I could have the community’s feedback for this problem. How can I beat CPU speed by taking advantage of Cuda?