Growing number of PCG iterations with cusparseDbsrsv2_solve

Hi guys,

I’m trying to calculate the system of linear equations Ax=b having sparse (<= 1% of non-zero blocks) 3x3 block matrix. I store it in BSR format. The matrix is symmetrical and positive-definite. Number of lines from 5k to 30k.
I’m trying to calculate it using conjugate gradient method from Saad Iterative Methods for Sparse Linear Systems book http://www.lmn.pub.ro/~daniel/ElectromagneticModelingDoctoral/Books/Numerical%20Methods/SaadIterativeMethods.pdf (ALGORITHM 7.6, page 263) mixing it with preconditioner ILU(2) (10.3.3, page 331 from the same book).
The system and the preconditioner work well due to CPU parallel computing. But the point of problem is the preconditioner’s matrix calculation (the calculation of both systems of linear equations with the top and bottom triangular matrix).
While trying to speed up the calculation I have used cusparse lib. And since I’ve used ILU(2) which isn’t implemented in cusparse I implemented PCG method for GPU using cusparseDbsrsv2_solve, cublasDdot, cublasDaxpy, etc… But this solutions requires much more iterations to achieve the same accuracy of PCG comparing to CPU-based implementation which seems pretty strange to me since it uses the same algorithm. And, I think, therefore GPU-solution works slower.

So guys, my questions are:

  • How exactly triangular matrix calculation in cusparseDbsrsv2_solve works? Maybe you have some links to dig for me?
  • Why did I get much more iterations of the gradient descent method?
  • Please find my examples attached. https://gist.github.com/maiorpa/d2b853f87a0c4d58163ba4de8e06bac8

    Thank you, I appreciate your time to help me. Feel free to refer me to a source.

    Best,
    Pavel

    P.S. Sorry for any inconvenience or mistakes It’s my first post.

    Did anyone face a similar problem? Should I clarify my question? Feel free to contact me if I need to specify my question.

    up

    Hi maiorpa,

    I’m trying to use cusparseDbsrsv2_solve in a similar way. I’m having problems understanding how to properly use this function. I’m generating the ILU(0,1,2) of 3x3 blocked matrices on the cpu, and I want to use cusparseDbsrsv2_solve to solve the ILU factors on a Krylov iterative solver. I actually did not manage to make it work with my matrices and I have try several experiments. Did you get some answer relative to your problem about how cusparseDbsrsv2_solve works?.

    I could be more descriptive about my experiments if it interests you.

    Hi maiorpa, as I mentioned a have experience some troubles with cusparseDbsrsv2_solve and a I’ve realize what does cusparseDbsrsv2_solve actually do. I think cusparse documentation should be improved to avoid some possible misunderstandings. So here goes the reason for the behavior you experience.

    If you pass to cusparse a blocked matrix, and call cusparseDbsrilu02 to construct a “blocked” ilu(0) decomposition of your matrix, you will actually get a scalar ilu(0) decomposition represented by a blocked matrix. i.e the scalar ilu(0) factorization of the blocked matrix, that is the first fact. The second fact is that, cusparseDbsrsv2_solve calls a scalar triangular solver and not a blocked one, so if, you compute a blocked ilu(0) decomposition on the cpu and try to use cusparseDbsrsv2_solve to solve then on the gpu, you will not get a correct solution.

    The reason for your blocked version take more iterations to converge on the gpu using cusparse, is because the generated ilu(0) preconditioner by cusparseDbsrilu02 (if you are using that function) may perform worst (it is true for some matrices) than a real blocked ilu(0) preconditioner which may be the actual version you are computing on the your cpu implementation.