Solving a Linear System of Equation with Very Large Sparse Coefficient Matrix Using SVD

I have about 400,000,000 linear system of equations Ax=b in which all As are of size 4096×3 and bs are of size 4096×1.

I have constructed a large sparse matrix, SX=B, where S is the sparse coefficient matrix consisting of many As (they are populated at the diagonal of S) . The solution vector X and right hand side vector B are formed by vertically stacking xs and bs.

S = [ 
    A1  0    ... 0
    0   A2 0 ... 0
    ...
    0 ...     0 AN
]

B = [
   b1
   b2 
   ...
   bN
]

I need to solve for solution using SVD/pseudoinverse.

Here are my understandings from my research on how to solve the newly formed SX=B using SVD/pseudoinverse:

  1. cuSolver doesn’t allow the matrices to be solved in a sparse format.

  2. cuSolver only allows the QR or LU factorization in the sparse format.

  3. cuSolver allows to find the approximate SVD over a large batch of matrices, which can be then used to solve the system of equations.

I have implemented the batched approximate SVD and tested it on the 10,000 matrices of size 4096 x 3. And this took 3.6 s on my laptop. This means to solve all 400,000,000 system of equations, it will take 40 hours. The CPU version of this using Intel MKL to solve 10,000 systems of Ax=b in a sequential manner takes 10,000×60us=60ms (GPU is 6 times slower).

I would appreciate it if I could have the community’s feedback for this problem. How can I beat CPU speed by taking advantage of Cuda?